PHPCon West 2003

phpcon_125x125_speaker.gif I will be speaking at PHPCon West 2003 on October 23 in Santa Clara, CA.

I’ll be giving an updated version of my One Year of PHP at Yahoo! talk. If you didn’t make it to Portland this summer, you can hear me live in the Bay Area this fall.

Here’s the abstract:

Running a high-performance dynamic website is a daunting task. The short development cycles needed to stay ahead of the competition demand a web-centric scripting language that is easy to maintain and update. After a year of using PHP, Yahoo! will discuss its findings about PHP’s strengths and weaknesses.

We will present 5 general techniques for optimal performance PHP in an enterprise environment, 6 ways to harden your PHP applications, and 4 techniques for managing a diverse PHP installation on thousands of web servers.

We’ll also look at some open problems, such as the difficulty in maintaining clean separation of content, presentation, and business logic.

From the perspective of a PHP developer, this talk will is more interesting than my PHPCon 2002 talk because this one gives some concrete suggestions on how to do large-scale PHP. My “Making the Case” talk was very introspective, which was interesting to the slashdot crowd because they got to learn about Yahoo!, but didn’t teach PHP folks anything new.

I also went about 10 minutes over my 45 minute budget at OSCON, so the fact that PHPCon is giving me a 60-minute block of time means I don’t need to cut anything out. :-)

The Cathedral, The Bazaar, and Apache

The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary A couple of weeks ago I read Eric Raymond’s The Cathedral and The Bazaar, a collection of essays about Open Source software. Raymond writes quite well for a techie (either that or he has a superb editor), and the book is coherent. I didn’t agree with most of the book, but I think it’s important to keep abreast of what other folks are writing about the space.

Despite my general disappointment in the book, Homesteading the Noosphere was quite good. In an essay describing how “ownership” of Open Source projects works, Raymond accurately states the previously unwritten code of behavior. Projects have owners. Contributions are welcome, especially when they’re written well. Project ownership can be transferred. Forking is strongly discouraged, although sometimes necessary as a last resort when the owner won’t accept changes and refuses to relinquish control of the project.

apache-feather.gif The Homesteading the Noosphere essay has actually prompted me to think a little bit about what’s going to happen with the Apache HTTP Server. The Apache Software Foundation is currently maintaining two separate versions of this product, 1.3.x and 2.0.x (and is also is working on 2.1.x). Although the 2.0 server has been stable and “recommended” for over a year now, there are lots of organizations that are still using the 1.3 platform. The ASF would like folks to move to 2.0, but the fact that they’re still making 1.3.x releases indicates that they recognize that migrating to 2.0 is no small undertaking. When there are security problems (and sometimes features) these changes are always made in 2.0 first, but need to get “backported” to 1.3.

But what if maintaining two separate products became too cumbersome and the ASF decided to stop making 1.3.x releases? I’ve wondered privately if any of the organizations that have a substantial investment in Apache/1.3 would want to take over the codebase (i.e. fork it). What would happen to the Apache community if someone decided to make an Apache/1.4 release? If the development was split across two projects, would both lose momentum (and therefore market share)? Would the vast majority of folks stand by the ASF and swallow the complexity of the 2.x server, while a “rogue” bunch of hackers simply caused social turmoil with 1.4 but never really made it successfully as a project? Or vice-versa?

Regardless of technical or social reasons, something called “Apache/1.4” couldn’t really happen without the ASF’s blessing. Although the code is Open Source so you could re-use it for another project, the Apache License is written in such a way that derivative products aren’t allowed to use the name “Apache”. But maybe there could be a Hopi/1.4 or a Mohican/1.4 HTTP server…

As Raymond writes in Homesteading the Noosphere, the natural motivation is to avoid forking unless absolutely necessary. In the case of Apache HTTP Server, there are decent technical and social alternatives to this last resort. So I’d hazard to guess that we’ll never see Apache/1.4.

Instead, we’ll probably see at most two more Apache/1.3 releases before the code is officially declared deprecated (which will probably happen right around the time that Apache/2.1 is released). Folks who have put off the 1.3-to-2.0 migration effort will take a serious look at a 1.3-to-2.1 jump, and the vast majority of them will make the move over the next two years. Sure, there will always be some laggards who are stuck using Apache/1.3.31, but by the end of 2005 their numbers will be so small that they’re not worth mentioning.

Apache 1.3.28 next week?

apache-feather.gif I don’t have time to read most of the Apache mailing lists, but I do keep an eye on the low-traffic cvs commit list.

There’s been a lot of discussion over the past month or so about the upcoming 1.3.28 release, and even a couple of dates proposed. The most recent message suggests that we’ll see a 1.3.28 release next week.

Taking a look at the CHANGES file, there’s not too much that I really need in this release. The past year has been pretty slow for Apache 1.3 development, in large part because folks are starting to move to 2.0.

PHP libcurl example

libcurl In one of the sections on my “One Year of PHP at Yahoo!” talk I’m giving next week, I mention the security implications of the allow_url_fopen config setting.

I recommend that people set allow_url_fopen off, and instead use the libcurl extension to do server-side HTTP fetches.

Here’s a comparison of a simple HTTP fetch using both techniques.

allow_url_fopen = On

<?php

$str = file_get_contents("http://www.example.com/");

if ($str !== false) {

// do something with the content

$str = preg_replace("/apples/", "oranges", $str);

// avoid Cross-Site Scripting attacks

$str = strip_tags($str);

echo $str;

}

?>

allow_url_fopen = Off

<?php

$ch = curl_init("http://www.example.com/");

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$str = curl_exec($ch);

if ($str !== false) {

// do something with the content

$str = preg_replace("/apples/", "oranges", $str);

// avoid Cross-Site Scripting attacks

$str = strip_tags($str);

echo $str;

}

curl_close($ch);

?>

It’s not that much additional work to use the curl extension, and you shield all of your regular file I/O against the possibility of accidentally acting as an open proxy. You avoid having to scrutinize every usage of fopen(), readfile(), file_get_contents(), include(), require() and related functions for the possibility that they might be used with a URL.

Logfile analyzers

I’ve just started using The Webalizer to do logfile analysis for radwin.org and hebcal.com.

Back in 1998 when I first started hosting my own domain name, I wanted to see where people were coming from and what they were viewing, so I set up the wwwstat script as a cron job to generate statistics.

I’ve never really liked the reports it gives, so last month I downloaded Analog, which claims to be “the most popular logfile analyser in the world.” It has every feature you could possibly imagine including graphs and charts, search referrer statistics, and even 31 different language output options. Perhaps because it’s so feature-rich, it is very difficult to compile and configure. I never got around to fixing my cron jobs to use it, in part because I couldn’t figure out how to send it data from stdin.

Last night Dave Jeske asked about log statistics for his blog (since I’m hosting on my site for the time being) and I told him the URL to the crappy wwwstat page that I generate daily from cron. I warned him that my ISP rotates logfiles daily, so the page never shows more than the past 24 hours of statistics.

He pointed out that webalizer has a slick -p option that lets you preserve state so you can run it multiple times and it incrementally adjusts the statistics. Neat.

So I downloaded the source code, ran ./configure --prefix=/home/mradwin/local --with-etcdir=/home/mradwin/local/etc and make all install and I was up and running in 15 minutes!

It’s not as slick as Analog, but I don’t care. It does exactly what I want, and nothing more.

Thanks for the tip, Dave. Enjoy your new stats URL.

Efficient Apache 1.3 setup for port 80 and 443

apache-feather.gif If you need to run both SSL and non-SSL Apache 1.3 on the same host, the most efficient way is to run two separate server instances rather than using <VirutalHost>s and mutltiple Listen directives.

If you use multiple Listen statements to listen on either multiple ports or multiple addresses, Apache needs to use select() in order to test each socket to see if a connection is ready.

If you only use a single Listen statement, Apache uses accept() instead of select(). All children can just block in accept() until a connection arrives.

There’s a long discussion about the inefficiencies and syncronization difficulties of using a select() loop rather than an accept() loop on the Apache 1.3 performance tuning page.

Excerpt from that document:

“Ideally you should run servers without multiple Listen statements if you want the highest performance.”

We’ve been doing this for years at Yahoo! No, it’s not Rocket Science; it’s right there on Apache 1.3’s perf-tuning web page.

But there are many examples of SSL config files floating around out there with multiple Listen statements. If the rest of the world’s engineers are anything like me, there is a strong temptation to find a conf file that works and just use it. The copy-and-modify approach is great when all you want is functionality. But when performance matters, you’ve gotta read the docs.

PHPCon East 2003

I’ve been invited to speak at PHPCon East 2003 in April:

PHPCon East 2003 – (April 23-25, 2003). PHPCon announces PHPCon East 2003 in New York City. This conference features two days of technical learning with speakers such as Rasmus Lerdorf, Michael Radwin, and Jeremy Zawodny. PHPCon East also adds a third, full day of tutorials offering practical, cogent PHP solutions and ideas including: MySQL and PHP; Building and Consuming Web Services with SOAP; Getting Started with PHP; High Performance PHP: Profiling and Benchmarking; and more PHPCon East has discounts for early registration, students, non-profits, and Tutorial/Conference packages. Early Bird Deadline is March 31st. For more program information, visit the PHPCon website. [PHP: Hypertext Preprocessor]

Unfortunately, the first two days of the conference also happen to be the last two days of Passover. So I’m not sure I’ll be able to make it. :-(

ApacheCon: LinkRot

Sander van Zoest started off by describing three commond causes of link rot:

  • Redesign/reorganize your website

  • Switch dynamic page language (for example, from JSP to PHP)
  • Typos (user hand-edits URL and makes a mistake)

Consequences? Link rot can be distilled down to one thing: 404 == bad user experience.

van Zoest spoke about some ways of detecting and discovering link rot in an auomated manner, and some Apache directives you can use to avoid the problem. Redirect, the mod_rewrite module, and using a PHP or CGI page for ErrorDocument 404 to try to dynamically redirect the URL to the new location.

The HTTP Content-Location header (not to be confused with the HTTP Location header) can be used to specify the permanent archive location of the current content. Useful for time-sensitive information, but user agents don’t really take advantage of this metadata.

van Zoest spent a few slides discussing how to avoid using things in URLs that one should avoid. For example, any query strings (the key=value pairs after the question-mark) make your pages less index-able by search engines, and you can often use Path Info instead. In addition, you can avoid extensions such as .php in URLs using techniques like Options +MultiViews, DefaultType, and ForceType.

In the future, Apache 2.0 could provide a map_to_storage hook which should help to make the URL-to-file system mapping less tightly coupled.

ApacheCon: Thursday Lunch

hard-rock-cafe.gif I got together with Ze’ev Suraski for lunch at the Hard Rock Cafe (just across the street from the Alexis Park Hotel).

We spoke about the matzav, how difficult it is to be a vegetarian during Pesach, Israeli politics, and our respective businesses. I got to practice a little bit of my Hebrew, but before I could embarass myself too much, we switched back to English.

I headed over to Stipe Tolj’s 1:30pm talk about using Apache as a WAP server, but I slipped into a post-lunch coma. I think I was awake for the last 20 minutes, so I got to hear a little about the Kannel server. Sounds interesting.