| radwin.org -> Michael J. Radwin -> blog -> Apache | Search |
Tomorrow, I'll be giving a talk entitled Hacking Apache HTTP Server at Yahoo!
Since 1996, Yahoo has been running Apache HTTP Server on thousands of servers and serving billions of requests a day. This session reveals the secrets of how Yahoo gets maximum performance out of minimal hardware by tweaking configuration directives and hacking the source code. Radwin will cover topics such as reducing bandwidth costs, extensible logfile format and rotation schemes, dumping core gracefully, and how to avoid the dreaded MaxClients, Max/MinSpareServers, StartServers configuration nightmare.
I love this topic. This is by far the most fun I have ever had preparing for a presentation. It's a privilege to be able to speak to such a savvy audience. I only wish I had more than 60 minutes. :-)
I'm at ApacheCon.
The Alexis Park Hotel is under construction. Charming.
Check out this l33t speaker button for ApacheCon 2004:
My HTTP Caching and Cache-busting for Content Publishers talk has nothing to do with Yahoo and nothing to do with PHP. 100% pure Apache, baby.
I'll be speaking about HTTP Caching and Cache-busting for Content Publishers at ApacheCon 2004 on November 17, 2004. This will be a revised version of the talk I gave at OSCON this summer with some new content and a better overall flow.
Slides are now online (HTML, PPT) for today's talk on HTTP Caching and Cache-busting for Content Publishers.
Abstract: A user's web experience can often be improved by the proper use of HTTP caches. Radwin discusses when to use and when to avoid caching, and how to employ cache-busting techniques most effectively. Radwin also explains the top 5 caching ad cache-busting techniques for large content publishers.
If you work in the Internet biz, read George's post Why PHP Scales - A Cranky, Snarky Answer. It's worth 10 minutes, and the crankiness level is a lot lower than the title would lead you to expect.
He's right that Java is actually faster than PHP because Java is a compiled language and PHP is not. And he's right that it really doesn't matter 99.9% of the time.
I just got a copy of Advanced PHP Programming by George Schlossnagle. It's the first good book published for PHP5, and an excellent read even for folks who are still using PHP4.
The book isn't just about PHP. It covers many aspects of the development process used to produce a robust, fast, maintainable website. George covers a range of topics you won't frequently find in a typical PHP book. For example, in Chapter 7 he spends a couple of pages discussing the different techniques for distributing files from your development environment into your production environment. He spends a large portion of the book discussing regression and unit testing, load testing and profiling/benchmarking. This isn't an ordinary PHP book.
The last hundred pages of the book are for really advanced users. George covers the PHP extension APIs in more detail than the online documentation at php.net. You've gotta be a C/C++ hacker to appreciate this stuff.
My only possible complaint about the book is that it's a little OO-centric. Most of the examples George presents use classes to provide some organization of data and grouping of functionality. His use of OO is a lot more palatable to me than the huge object hierarchies you find in some projects. I've never understood why people want something like log4php which adds 10k LOC to your application and adds little value over the built-in syslog().
I have been invited to speak about HTTP caching and cache-busting at the O'Reilly Open Source Convention in July 2004.
Abstract of my talk:
A user's web experience can often be improved by the proper use of HTTP caches. This talk discusses when to use and when to avoid caching, how to employ cache-busting techniques most effectively, and how to diagnose problems with caches.
In particular, this talk will cover:
Hope to see you there.
Zawodny thinks he needs mod_gzip but in fact he'd probably like mod_deflate for Apache/1.3 better.
I'm at PHPCon West 2003 this week.
I didn't make it to the Code Sprint today, but Andrei was leading one of the sections, so I'll get the skinny from him. I really dig the idea -- get programmers to pay for the opportunity to contribute their brainpower to an Open Source project.
I will be speaking at PHPCon West 2003 on October 23 in Santa Clara, CA.
I'll be giving an updated version of my One Year of PHP at Yahoo! talk. If you didn't make it to Portland this summer, you can hear me live in the Bay Area this fall.
Here's the abstract:
Running a high-performance dynamic website is a daunting task. The short development cycles needed to stay ahead of the competition demand a web-centric scripting language that is easy to maintain and update. After a year of using PHP, Yahoo! will discuss its findings about PHP's strengths and weaknesses.
We will present 5 general techniques for optimal performance PHP in an enterprise environment, 6 ways to harden your PHP applications, and 4 techniques for managing a diverse PHP installation on thousands of web servers.
We'll also look at some open problems, such as the difficulty in maintaining clean separation of content, presentation, and business logic.
From the perspective of a PHP developer, this talk will is more interesting than my PHPCon 2002 talk because this one gives some concrete suggestions on how to do large-scale PHP. My "Making the Case" talk was very introspective, which was interesting to the slashdot crowd because they got to learn about Yahoo!, but didn't teach PHP folks anything new.
I also went about 10 minutes over my 45 minute budget at OSCON, so the fact that PHPCon is giving me a 60-minute block of time means I don't need to cut anything out. :-)
A couple of weeks ago I read Eric Raymond's The Cathedral and The Bazaar, a collection of essays about Open Source software. Raymond writes quite well for a techie (either that or he has a superb editor), and the book is coherent. I didn't agree with most of the book, but I think it's important to keep abreast of what other folks are writing about the space.
Despite my general disappointment in the book, Homesteading the Noosphere was quite good. In an essay describing how "ownership" of Open Source projects works, Raymond accurately states the previously unwritten code of behavior. Projects have owners. Contributions are welcome, especially when they're written well. Project ownership can be transferred. Forking is strongly discouraged, although sometimes necessary as a last resort when the owner won't accept changes and refuses to relinquish control of the project.
The Homesteading the Noosphere essay has actually prompted me to think a little bit about what's going to happen with the Apache HTTP Server. The Apache Software Foundation is currently maintaining two separate versions of this product, 1.3.x and 2.0.x (and is also is working on 2.1.x). Although the 2.0 server has been stable and "recommended" for over a year now, there are lots of organizations that are still using the 1.3 platform. The ASF would like folks to move to 2.0, but the fact that they're still making 1.3.x releases indicates that they recognize that migrating to 2.0 is no small undertaking. When there are security problems (and sometimes features) these changes are always made in 2.0 first, but need to get "backported" to 1.3.
But what if maintaining two separate products became too cumbersome and the ASF decided to stop making 1.3.x releases? I've wondered privately if any of the organizations that have a substantial investment in Apache/1.3 would want to take over the codebase (i.e. fork it). What would happen to the Apache community if someone decided to make an Apache/1.4 release? If the development was split across two projects, would both lose momentum (and therefore market share)? Would the vast majority of folks stand by the ASF and swallow the complexity of the 2.x server, while a "rogue" bunch of hackers simply caused social turmoil with 1.4 but never really made it successfully as a project? Or vice-versa?
Regardless of technical or social reasons, something called "Apache/1.4" couldn't really happen without the ASF's blessing. Although the code is Open Source so you could re-use it for another project, the Apache License is written in such a way that derivative products aren't allowed to use the name "Apache". But maybe there could be a Hopi/1.4 or a Mohican/1.4 HTTP server...
As Raymond writes in Homesteading the Noosphere, the natural motivation is to avoid forking unless absolutely necessary. In the case of Apache HTTP Server, there are decent technical and social alternatives to this last resort. So I'd hazard to guess that we'll never see Apache/1.4.
Instead, we'll probably see at most two more Apache/1.3 releases before the code is officially declared deprecated (which will probably happen right around the time that Apache/2.1 is released). Folks who have put off the 1.3-to-2.0 migration effort will take a serious look at a 1.3-to-2.1 jump, and the vast majority of them will make the move over the next two years. Sure, there will always be some laggards who are stuck using Apache/1.3.31, but by the end of 2005 their numbers will be so small that they're not worth mentioning.
I don't have time to read most of the Apache mailing lists, but I do keep an eye on the low-traffic cvs commit list.
There's been a lot of discussion over the past month or so about the upcoming 1.3.28 release, and even a couple of dates proposed. The most recent message suggests that we'll see a 1.3.28 release next week.
Taking a look at the CHANGES file, there's not too much that I really need in this release. The past year has been pretty slow for Apache 1.3 development, in large part because folks are starting to move to 2.0.
In one of the sections on my "One Year of PHP at Yahoo!" talk I'm giving next week, I mention the security implications of the allow_url_fopen config setting.
I recommend that people set allow_url_fopen off, and instead use the libcurl extension to do server-side HTTP fetches.
Here's a comparison of a simple HTTP fetch using both techniques.
<?php
$str = file_get_contents("http://www.example.com/");
if ($str !== false) {
// do something with the content
$str = preg_replace("/apples/", "oranges", $str);
// avoid Cross-Site Scripting attacks
$str = strip_tags($str);
echo $str;
}
?>
<?php
$ch = curl_init("http://www.example.com/");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$str = curl_exec($ch);
if ($str !== false) {
// do something with the content
$str = preg_replace("/apples/", "oranges", $str);
// avoid Cross-Site Scripting attacks
$str = strip_tags($str);
echo $str;
}
curl_close($ch);
?>
It's not that much additional work to use the curl extension, and you shield all of your regular file I/O against the possibility of accidentally acting as an open proxy. You avoid having to scrutinize every usage of fopen(), readfile(), file_get_contents(), include(), require() and related functions for the possibility that they might be used with a URL.
I've just started using The Webalizer to do logfile analysis for radwin.org and hebcal.com.
Back in 1998 when I first started hosting my own domain name, I wanted to see where people were coming from and what they were viewing, so I set up the wwwstat script as a cron job to generate statistics.
I've never really liked the reports it gives, so last month I downloaded Analog, which claims to be "the most popular logfile analyser in the world." It has every feature you could possibly imagine including graphs and charts, search referrer statistics, and even 31 different language output options. Perhaps because it's so feature-rich, it is very difficult to compile and configure. I never got around to fixing my cron jobs to use it, in part because I couldn't figure out how to send it data from stdin.
Last night Dave Jeske asked about log statistics for his blog (since I'm hosting on my site for the time being) and I told him the URL to the crappy wwwstat page that I generate daily from cron. I warned him that my ISP rotates logfiles daily, so the page never shows more than the past 24 hours of statistics.
He pointed out that webalizer has a slick -p option that lets you preserve state so you can run it multiple times and it incrementally adjusts the statistics. Neat.
So I downloaded the source code, ran ./configure --prefix=/home/mradwin/local --with-etcdir=/home/mradwin/local/etc and make all install and I was up and running in 15 minutes!
It's not as slick as Analog, but I don't care. It does exactly what I want, and nothing more.
Thanks for the tip, Dave. Enjoy your new stats URL.
If you need to run both SSL and non-SSL Apache 1.3 on the same host, the most efficient way is to run two separate server instances rather than using <VirutalHost>s and mutltiple Listen directives.
If you use multiple Listen statements to listen on either multiple ports or multiple addresses, Apache needs to use select() in order to test each socket to see if a connection is ready.
If you only use a single Listen statement, Apache uses accept() instead of select(). All children can just block in accept() until a connection arrives.
There's a long discussion about the inefficiencies and syncronization difficulties of using a select() loop rather than an accept() loop on the Apache 1.3 performance tuning page.
Excerpt from that document:
"Ideally you should run servers without multiple Listen statements if you want the highest performance."
We've been doing this for years at Yahoo! No, it's not Rocket Science; it's right there on Apache 1.3's perf-tuning web page.
But there are many examples of SSL config files floating around out there with multiple Listen statements. If the rest of the world's engineers are anything like me, there is a strong temptation to find a conf file that works and just use it. The copy-and-modify approach is great when all you want is functionality. But when performance matters, you've gotta read the docs.
I'm trying to get PHP 5.0.0-dev to work under Cygwin. It hasn't been much fun so far.
I'm working with George on a non-Yahoo! project, so I've gotta get PHP installed on a computer that I can use at home. So I figured I'd just build it our Windows 2000 laptop under Cygwin. Should be a piece of cake, right?
Last week I got PHP-4.3.1 running without much trouble. The only snag I ran into involved some mangling of the symbols in the xml (expat) extension. The workaround was to use the Cgywin installer to install a system (shared object) version of expat, then build PHP4 with ./configure --with-expat-dir=/usr instead statically linking in the bundled expat. It worked correctly as both a command-line binary and also when patched into the thttpd web server.
However, it turns out that George is using PHP 5, so I've gotta use the same thing. After a busy week last week, I finally got around to fetching the PHP5 source from CVS and started building it tonight.
First I had to install the Cygwin ports of the GNU build toolchain (autoconf, automake, libtool). Piece of cake. Then, much to my surprise, I got the thing to build with a simple ./buildconf && ./configure --with-expat-dir=/usr && make. But that's when my luck ended:
radwin@radwin /usr/src/php-dev/php5 $ ./sapi/cli/php.exe -v Segmentation fault (core dumped) radwin@radwin /usr/src/php-dev/php5 $Investigating in gdb (after I installed the Cygwin port), I found this:
(gdb) bt #0 0x610ab674 in memcpy () from /usr/bin/cygwin1.dll #1 0x004d6981 in ini_parse () at Zend/zend_ini_parser.c:1040 #2 0x004d6631 in zend_parse_ini_file (fh=0x22fb30, unbuffered_errors=1, ini_parser_cb=0x4a9ad0, arg=0x51dd00) at /usr/src/php-dev/php5/Zend/zend_ini_parser.y:156 #3 0x004a9f27 in php_init_config () at /usr/src/php-dev/php5/main/php_ini.c:416 #4 0x004a625a in php_module_startup (sf=0x51b590, additional_modules=0x0, num_additional_modules=0) at /usr/src/php-dev/php5/main/main.c:1270 #5 0x005075a3 in main (argc=2, argv=0x10041800) at /usr/src/php-dev/php5/sapi/cli/php_cli.c:563 (gdb)
Bleh. A core in memcpy()? Something's messed up. I hate it when this happens.
For kicks, I tried checking out the tree with the PHP_5_0_dev_before_13561_fix tag, on the hunch that if someone bothered to tag the tree at some point, it probably worked that day. No luck; it dumps core in the same place.
I'm sure this would all be soooooo easy if I had a Linux box.
I've been invited to speak at PHPCon East 2003 in April:
PHPCon East 2003 - (April 23-25, 2003). PHPCon announces PHPCon East 2003 in New York City. This conference features two days of technical learning with speakers such as Rasmus Lerdorf, Michael Radwin, and Jeremy Zawodny. PHPCon East also adds a third, full day of tutorials offering practical, cogent PHP solutions and ideas including: MySQL and PHP; Building and Consuming Web Services with SOAP; Getting Started with PHP; High Performance PHP: Profiling and Benchmarking; and more PHPCon East has discounts for early registration, students, non-profits, and Tutorial/Conference packages. Early Bird Deadline is March 31st. For more program information, visit the PHPCon website. [PHP: Hypertext Preprocessor]
Unfortunately, the first two days of the conference also happen to be the last two days of Passover. So I'm not sure I'll be able to make it. :-(
Sander van Zoest started off by describing three commond causes of link rot:
Consequences? Link rot can be distilled down to one thing: 404 == bad user experience.
van Zoest spoke about some ways of detecting and discovering link rot in an auomated manner, and some Apache directives you can use to avoid the problem. Redirect, the mod_rewrite module, and using a PHP or CGI page for ErrorDocument 404 to try to dynamically redirect the URL to the new location.
The HTTP Content-Location header (not to be confused with the HTTP Location header) can be used to specify the permanent archive location of the current content. Useful for time-sensitive information, but user agents don't really take advantage of this metadata.
van Zoest spent a few slides discussing how to avoid using things in URLs that one should avoid. For example, any query strings (the key=value pairs after the question-mark) make your pages less index-able by search engines, and you can often use Path Info instead. In addition, you can avoid extensions such as .php in URLs using techniques like Options +MultiViews, DefaultType, and ForceType.
In the future, Apache 2.0 could provide a map_to_storage hook which should help to make the URL-to-file system mapping less tightly coupled.
I got together with Ze'ev Suraski for lunch at the Hard Rock Cafe (just across the street from the Alexis Park Hotel).
We spoke about the matzav, how difficult it is to be a vegetarian during Pesach, Israeli politics, and our respective businesses. I got to practice a little bit of my Hebrew, but before I could embarass myself too much, we switched back to English.
I headed over to Stipe Tolj's 1:30pm talk about using Apache as a WAP server, but I slipped into a post-lunch coma. I think I was awake for the last 20 minutes, so I got to hear a little about the Kannel server. Sounds interesting.
In his keynote address "New Ways of Thinking About Security: Open Source Thinking in a Bunged-up World", Richard Thieme spoke about the contrast between linear thinking and network thinking in society. He posits that the Open Source movement represents a new kind of freedom and that chaotic and continually evolving.
Thieme spoke about how members of the CIA and the KGB had more in common with each other than they did with their respective political environments. Even though we think of free-market and communist countries as being opposites, the suppositions and the schemas to understand and categorize the world used by the intelligence community set them apart from the rest of the communities. He made a parallel to Open Source networks of programmers.
He claimed that writing code is a form of leadership, because leadership is saying what you think of the world in a clear and visceral way. It doesn't require structural authority. Rather, writing code is functional leadership. Since leadership has two components (saying and doing), coding is in fact a true expression of leadership because it both expresses ideas and it performs a function.
He also spoke about authorship and intellectual property rights, and how these concepts were completely foreign before the invention of the printing press. Centuries later, Open Source and distributed networking are working to undermine those concepts. How do you define property when you share the information back and forth?
Security, identity, borders, and intellectual property rights are a function of clear boundaries. But, Thieme says, boundaries are not clear (and they're getting less clear). We are moving towards a collective identity, away from the nation-state.
He wrapped but by describing Richard Stallman as a saint (saying that all saints are a little crazy), that it takes someone of an obsessive-compulsive mind to make truly amazing things happen.
Rob McCool, continuing in the spirit of the easy to understand but never-adopted Meta Content Framework and the standard but substantially harder to grok Resource Description Framework, presented TAP.
The overall problem is that there is a ton of data out there on the web, but it's not in machine-understandable form. McCool is looking at addressing key problems of supporting a true web of data: query languages, canonical names, caching, and a system of trust (to avoid spammers).
On the query langauges front, McCool believes that SQL and XQL are overkill, but HTTP GET is not specific enough. So TAP defines a GetData protocol. It follows in the spirit of the DNS system, where you can use a gethostbyname() function to access the service. TAP uses RDF schemas to describe graphs of data, and SOAP as the over-the-wire protocol for querying.
McCool described a module called TAPache to implement the GetData protocol. In the same way that Apache provides an htdocs directory, it provides an RDF repository. His stated goal for TAPache is to be the "BIND" application for data.
Since Amazon.com and CDnow might have different identifiers for the same album, TAP doesn't require using globally unique identifiers. But how do you tell the difference between "Michael Jackson" the musician and "Michael Jackson" your next-door neighbor? TAP addresses this using reference by desciption hen you want to do a query for "Michael Jackson", you ask for someone whose firstName="Michael" and lastName="Jackson" and profession="Musician" and who is the author of an album with title="Thriller".
When asked about the problem of matching "Donald Rumsfeld" and "Donald H. Rumsfeld" or "al Qaeda" and "al-Qaida", McCool said that there are some decent algorithms for matching names that go beyond simple string comparisons. Sounds like a substantially difficult project to me. Is my laptop an "IBM 390X" or is it "390-X by IBM"?
An interesting sample application was a "related items" sidebar for news stories. In addition to doing simple Capitalized Words extraction from the document, you could envison something that used the RDF graphs to discover that Brett Favre was a football player and match that with eBay auctions for tickets for the Green Bay Packers.
Ze'ev Suraski began by giving a brief technical history of the PHP language. PHP 1 and 2 were developed around 1995. PHP 3, which started using lex/yacc more efficiently, was released in June 1998, execute while parsing. PHP 4 (May 2000) greatly improved performance (swtiched to a "compile first/execute later" paradigm, added reference counting) and improved the web server and extension APIs.
Zend Engine 2 uses a Java-like object model. All objects will be passed by reference, not by value. Other improvments include $obj->foo()->bar() dereferencing, __clone(), destructors (objects may define a __destruct() function), and unified constructors, and static class variables. It also adds some support for namespaces, exceptions (try, catch, and throw). In the future, Zend Engine 2 might add multiple inheritance and private member variables.
Suraski went on to show some demos of the Zend Engine 2, pointing out how much easier it is to do things like design patterns when you have an objects-are-references system. He showed an example of a factory method that just didn't work without adding 4 ampersands in strategic locations. We also saw demos of the clone, destructor, and debug backtrace features.
All of his demos used the Zend Studio IDE which seems to have pretty good syntax highlighting, integrated debugger, and a nice help system.
Although the static class variables are a great idea, I didn't love the self::$foo syntax you had to use to access them. Coming from a C++ and Java background, I would expect that $this->$foo syntax to work on the class variable (instead of dynamically creating an instance variable of the same name).
Sander, Jade, Zak, Shane and I headed off to the Mirage for their buffet lunch. Coversation included dead end jobs at dental labs, LindowsOS, differences between Israeli and Diaspora Jews, the conference scene, and handling traffic surges on websites (Fifa World Cup, 9/11, slashdot).
After lunch, I had a drink with Randy Terbush. We talked about Apache 2.0, Tribal Knowledge, where Yahoo! is going with the Open Source movement, and a little about the ASF in general.
I caught the tail end of Stas Bekman's talk on mod_perl 2.0. I own a copy of the classic Writing Apache Modules in Perl and C, but I've only read the C parts of the book. In short, mod_perl 2.0 looks like it's going to be really useful when it's done. I got the impression that it's still not ready for Prime Time, but it's closer than it was 4 months ago. All new technologies need some soak time to work out all of the bugs.
Aaron Bannert's talk was about thread-safe Apache code and how to use the APR when writing modules that need to do fancy synchronization and locking. I didn't have the energy to take good notes and the room was a little too warm which induced some degree of sleepiness. So I'll just summarize the session as follows: writing portable thread-safe code is a pain in the neck, and the APR makes it slightly easier. But it's still a pain.
John Fowler, Sun's Software CTO, spoke about Sun's commitment to the Open Source movement.
Nothing earth-shattering. Usual corporate pitch about how we love the O-S movement, and look how much wonderful stuff we've opened up. Java Community Process 2.5 sounds kinda interesting.
What was more interesting was what Fowler didn't say. I didn't hear the words Sparc or Solaris mentioned once.
According to Craig McClanahan, writing a web application is more difficult than writing a traditional application for a couple of pretty simple reasons:
To complicate matters more, building any large scale application requires a large set of skills: presentation, application (business logic), persistence (files, databases), and application deployment (networks, firewalls, PKI). McClanahan claims we need Model-View-Controller as a fundamental organizing principle in designing and developing these applications.
McClanahan gave a brief overview of the MVC design paradigm in general, and how to use it in the web context in particular.
Overall, the presentation was right on the money, but McClanahan missed the mark when he was talking about the Back button. His claim was that web applications are fundamentally different than surfing around visiting web pages, and that we need to train users not to use the Back button on their web browsers when they're using a web application. I disagree wholeheartedly. The Back button is a very powerful metaphor for a couple several reasons:
Okay, enough ranting. Maybe I'll buy McClanahan a drink tonight and try to convince him to spend some time reading Jakob Nielsen. I'll get back to taking notes on the presentation now.
Sometime while I was composing my rant, McClanahan started talking about Struts. Struts focuses mostly on the Controller aspect of MVC, delegating the Model aspect to other Sun technologies like Enterprise Java Beans and JDBC, and letting you use JSP for the View component.
He used a web logon application as a motivating example to show each of the MVC layers. We saw some slides of JSP syntax for the View Layer (login.jsp), and XML config file that struts uses for the Controller layer which maps URI paths and "actions" to Java class names and parameters. One of the XML config files essentially tells Struts what classes it will need to instantiate. Another encodes rules about what the data should look like (username and password being non-empty, etc.)
It appears that Struts uses many levels of indirection to make your application as reusable as possible. For example, your business logic Java code shouldn't import anything form org.apache.struts because that would bind it too tightly to the Struts framework. Similarly, you need to avoid importing javax.servlet into your business logic because that binds too closely to a web application. This level of abstraction aids in reusability, but it takes a lot of work to keep it clean.
Sander van Zoest, formerly of MP3.com, gave a great introduction to serving audio via Apache and other servers.
van Zoest described several different ways to deliver audio: HTTP downloading, HTTP streaming, Real Time Streaming Protocol (both on-demand streaming and live broadcast) and Windows Media Player's MMS protocol. Since audio players seem to have pretty dumb HTTP implementations, it's important to get things like MIME types exactly right.
He listed off a bunch of common audio formats and how to configure their extensions and MIME types correctly (pointing out that the correct MIME type for .mp3 files is audio/mpeg, but most websites get this wrong). He also spent a bit of time on the audio meta-formats such as M3U, SDP, ASX, and SMIL, which describe playlists. These formats are the bridge between your Apache server and your audio player. They are described in more detail in Sander's online notes.
Next, we looked at configuring Apache to serve large audio files. In short, you usually need to set your MaxClients and TimeOut directives to large values. Also, at the operating system level, you probably need to increase the size of your TCP listen queues to avoid clients seeing "connection refused" messages.
We also examined the Shoutcast and Icecast protocols for streaming audio.
A couple of questions from the audience asked if anyone was writing Apache 2.0 Protocol modules for RTSP and RTP, or whether the RealNetworks new Helix initiative was working with the ASF. van Zoest answered that it seemed like there could be some real synergy there (using the same Apache server and configuration files to serve both HTTP and non-HTTP streaming audio) but that there hadn't been much interaction between the two groups so far. He also mentioned licensing incompatibilities as a potential barrier to integration.
Theo challenged the audience to recognize the difference between replicateable data and non-replicateable data. Again the theme of the right tool for the job came up. Replicateable data needs marginal protection so you can use commodity hardware. Non-replicateable data needs single-point reliability, so you should consider "Enterprise" hardware.
He then put up a picture of the typical two tier model of load-balancers, content web servers, image web servers and Master/Slave DBs. I'm glad to see that he pointed out the idea of splitting images out onto a separate web server; simple trick which can help you scale. The typical three tier architecture looks mostly the same, but adds Application Servers and some more load balancers. This picture looks a lot more like the Yahoo! architecture.
Theo described hardware- and software-based load-balancing alternatives, and the tradeoffs. I was tickled to see DNS Round Robin as one of the software load balancing choices. OmniIT seems to really like wackamole and mod_backhand. In general, he seemed to prefer free software solutions over the expensive black box hardware solutions. I guess this is probably because he makes money off his consulting practice by customizing all of that complicated software.
Diving into the software load-balancing altnernatives, he described a project called Walrus. Walrus tries to pick the right server cluster by taking advantage of something in the DNS RFC which says that clients are supposed to measure DNS latency and pick the "closest" DNS server. Eventually users migrate towards one DNS server, and those servers (east coast vs. west coast) return disparate sets of IP addresses. Walrus is great in theory, but DNS isn't implemented consistently on all clients, so it doesn't work universally.
Theo proposes using Shared IP for DNS servers (but not for Web clusters) and assigning the same IP address to your DNS servers in different locations. This only really works well if your network provider is the same in both places and willing to work with you to make it happen.
George felt that the rsync/scp/ftp method of collecting logs was terrible. He doesn't like the fact that it uses unicast, so if you need to copy the logs to more than one place you need to do it multiple times, and he really disliked the fact that you can't run real-time analysis on the logs.
He examined using syslog as an alternative to support real-time logging to a loghost, but due to the fact that it's built on top of UDP, it's unreliable (which might not work well with your business requirements). Also, the syslog implementations on many hosts are inefficient.
Database logging solves the reliability, real-time and centralization problems, but all of that relational database overhead substantially slower than writing to a file. And all of those rows start to add up quickly. Imagine a website like Yahoo! with over 1.5 billion pageviews a day.
mod_log_spread does a reliable multicast approach which allows for realtime processing of log data. George pointed out that realtime processing is fantastic for helping to notify you of things like 500 Internal Server Errors so you know when to do some on-the-fly debugging in your production environment.
Finally, Theo demo'd a cool Cocoa app for seeing real-time web statistics using the Spread Daemon.
Theo and George Schlossnagle gave a 2 hour talk on a hodge-podge of a few topics for scaling large websites. I'll split this into two blogs.
First, George pointed out one of the easiest tricks to optimize a large website: turn KeepAlive Off. No surprise here; Yahoo! has been doing this for a long time. It's very resource-intensive to keep Apache children around for each of your clients, and heavy-traffic sites can't afford to do this (even it it makes the client experience marginally faster).
[Editorial comment: proponents of KeepAlive often point out that the overhead of establishing 13 TCP connections to fetch an HTML page plus a dozen images really sucks when you could simply have a single TCP connection. However, on really large sites, images are often hosted on a completely different host (i.e. images.amazon.com or us.yimg.com) so running KeepAlive on the dynamic HTML machine is pointless. Users usually spend several seconds reading a page before clicking on another link.]
Next, he pointed out that you should set SendBufferSize to your maximal page size (something like 40K or however large your HTML pages tend to be). This way you effectively send each page with a single write() call so your web app (running dynamic stuff like PHP or mod_perl) never needs to block waiting for the client to consume the data.
Use gzip compression as a transfer encoding. You spend more CPU cycles but save on bandwidth. For a large website, bandwidth is far more expensive than buying more CPU power.
Don't use Apache 1.3 for static content: use Apache 2.0, thttpd, or tux. In other words, use the right tool for the job.
George went into more detail about setting up reverse proxies (also known as HTTP accelerators) to handle clients with slow connections in front of your dynamic content servers. He discussed a couple of different approaches using mod_proxy in conjunction with mod_rewrite and mod_backhand.
George made a compelling argument that commercial caching appliances can never do as good of a job of caching your data as you can do yourself using tools like Squid. He pointed out tradeoffs between black-box caching products (don't require any changes to your application) and application-integrated caching (highly efficient, but requires rewriting your app).
Application-integrated caching can use a convenient shared storage system (like an NFS-mounted disk) or can write to more efficient local storage on each host and use some sort of messaging system to communicate to your server pool when they need to dirty or invalidate their caches. This is not terribly difficult to do with PHP or Perl using Spread and the XML-RPC hooks (we saw about 4 or 5 slides on how to implement all that).
Ok, I was a little misguided when I wrote earlier that XML/I18N was off-topic for an Apache conference. While not about the Apache server itself, these technologies are in fact widely used in today's environment of HTTP and Apache.
But how about something completely different: a brand new protocol to replace HTTP? Whoa. (This is why it's great to go to conferences. You get exposed to all of these neat ideas that you don't always have time to think about as you're doing your day job.)
Fielding covered a lot of ground in his 60 minutes. He started off discussing how Web Services is yet another example of how to solve the general problem of Enterprise Application Integration. What's important about Web Services, Fielding points out, is that it helps to solve the integration problem. Ideally, once you have web services, you don't have to do N^2 integrations for N corporate applications.
(Editorial comment: even though Web Services lets a bunch of different applications speak the same protocol, it doesn't mean they understand the same stuff. PeopleSoft's concept of what an Employee means is going to be different from what SAP thinks it means. Web Services can get the two applications in the same room as each other, but it can't get them speaking the same language.)
Fielding went on to explain what's great about HTTP as a protocol, and also pointed out some of the difficulties it presents. He described the REST architecture's influence on the HTTP/1.1 spec, and then gave some further background on the HTTP protocol. He pointed out a few important limitations of HTTP/1.1:
Fielding suggests that a new protocol standard could solve HTTP's current problems in a generic way. "It's not like we're all going to keel over and die if we don't get a replacement for HTTP, but it would be really valuable in some communities." He proposes waka, which one could evision as a sort of HTTP/2.0. In fact, it takes advantage of the HTTP protocol upgrade feature that was implemented in HTTP/1.1. Waka is designed to match the efficiency of the REST architectural style.
Waka adds a handful of new verbs (methods) to HTTP:
Fielding went on in much more detail about all of these cool features of waka. One of my favorites was the ability for clients to define macros. You could use this feature to define a macro for a User-Agent string, then avoid sending all of those bytes on future interactions. I also like the fact that due to the asynchronicity of the protocol, you can interleave data and metadata. This could be pretty handy if you realized that you needed to issue a Set-Cookie header in the middle of a response.
The transactional support is also pretty important. Imagine that you're trying to make an online payment of a non-trivial amount of money, and your flaky internet connection drops in the middle of the request. Did the payment go through? These days we get things like email confirmation messages, or you can log back on when your internet connection comes back up later and see if the transaction is mentioned in the transaction history. Waka could provide more protocol-level support.
Lastly, Fielding points out that waka is very much a work in progress. It hasn't been fully spec'd or implemented, but he's working actively on it. Expect to hear more about it in the coming year or so.
[Update: Apparently I misunderstood Fielding's comments about the N vs. N^2 integration problem. According to Jeff Bone, "The basic argument is that type-specific interfaces lead to O(N^2) integrations, while generic interfaces lead to O(N) integrations. Web Services as cast today (SOAP as RPC, etc.) have type-specific interfaces and O(N^2) integration complexity; truly RESTful Web Services would in contrast use generic interfaces, and have integration complexity O(N)."]
After lunch, I headed off to see a slightly-off-topic presentation on XML and Internationalization by Yahoo!'s own Sander van Zoest.
van Zoest began discussing Unicode overall, covered the UTF variants such as UTF-8, and BOM (Byte Order Marks). Moving into XML itself, he gave an overview of the intended use of the xml:lang tag and the ISO-639-2 language codes and ISO-3166 country codes.
XML also supports Numerical Character References such as € for the EURO SIGN (€). These may be expressed in hex (&#xHHHH;) or decimal (&#DDDD;) and always contain the Unicode code point value (regardless of which UTF scheme the document is encoded in). NCRs can be handy when you need to represent a the character does not exist in the document's encoding scheme, but can't be used in element or attribute names, or in CDATA and PIs.
The presentation gave examples of how to do character set transformations using XSLT, Perl 5.8, and Java.
We got the usual pitch to use tags with semantic value such as <important> instead of <b> and to use sytlesheets to do presentation instead of cluttering up the markup.
van Zoest failed to mention the perennial question on this subject: why the heck does "i18n" stand for "internationalization"? The answer is that there are 18 letters between the beginning "i" and ending "n" in the word "internationalization".
I went to the talk on Apache 2.0 Filters by Greg Ames. I already knew a little bit about the Apache 2.0 Filtered I/O model from a session at the O'Reilly Open Source conference, but since I'm going to sit down and write one Real Soon Now, I oughta learn a little more about 'em.
Ames gave an example of the bucket brigade API by showing snippets of the mod_case_filter code. Looks pretty elegant and simple.
He then went into details about why the naive approach fails miserably from a performance perspective, and showed some examples of how to do a filter the Right Way. It turns out that this is really complex.
Aside from all sorts of error conditions, there are lots of things to worry about with resource management. Do you allocate too much virtual memory? How often do you flush? Looks like you need to regularly flip between blocking and non-blocking I/O to do this right.
Filters that need to examine every single byte of the input (such as things that parse HTML or other tags) are even more complicated because you need to allocate private memory when a tag spans more than one bucket. Bleh. My mod_highlight_filter idea is going to be difficult to implement.
Ames then talked about the mod_ext_filter module from the Apache Directive perspective. I would've rather seen some slides about the implementation of this rather complex filter, but perhaps that would have been too technical for the audience.
He also discussed some tricks about how to debug Apache more easily with gdb and using 2 Listen statements (as a way to avoid starting with the -X option), and some useful gdb macros for your ~/.gdbinit file which make examining the bucket brigade easier. Cool tips. I guess I misjudged the technical level here; he probably skipped the implementation of mod_ext_filter because it would've taken too much time.
Tim O'Reilly gave this morning's keynote address. (Actually, what's bizarre is that he's actually giving the keynote address right now and I'm blogging via an 802.11b WLAN.)
O'Reilly spoke about early adopters being a good predictor for technology trends. He compared the models of Napster and MP3.com (distributed vs. client-server models) and how it often takes someone to look at technology in a completely different way in order to make progress -- cheap local storage and always-on networking are changing the computing landscape. He says the killer apps of today are all network applications: web, mail, chat, music sharing.
The best laugh came at the moment when he said that he thinks the phrase "Paradigm Shift" gets overused so much that it is starting to generate groans the way the phrase "The Knights Who Say Nee!" has done for years.
O'Reilly also spoke about applications migrating towards platforms. For example, instant messaging is an application (AIM, Y! Messenger, MSN Messenger) but it is becoming a platform (Jabber, AIM-iChat integration).
Before the talk, I actually had breakfast with O'Reilly (the restaurant was packed and we both grabbed seats at the same table) and we talked about the world of free software. He suggested writing an article for the O'Reilly newsletter about Y! moving away from yapache (our Apache web server variant) towards a more standard Apache server. (I mentioned our weird mod_yahoo_ccgi thing which is like a crippled version of mod_so, but we invented our own because back in 1996 we had a need for DSOs before they were directly supported in the Apache server.) After the PHP news "debacle" last month, we'll see if I can get permission to write openly about the subject.
I'm off to Vegas tonight for the ApacheCon conference. I'm looking forward to learning about what people are doing with Apache 2.0.
I've got this idea for a cool Apache 2.0 filter which I'm planning to work on in January. Basically, I'd like to be able to hilight search terms on a web page. You could tell what the user searched for by looking at the HTTP Referer header for patterns like http://www.google.com/search?q=search+terms+here and then highlight them as they appear in the page.
We could certainly use such a feature at work internally, and it would be yet another incentive for folks to make the switch from Apache 1.3 to Apache 2.0.
Here's a neat little trick. If you want to serve out PHP scripts without showing the .php extension, you can add something like this to your httpd.conf file:
DefaultType application/x-httpd-php DirectoryIndex index index.html
Those directives will tell Apache that if there is no extension on a file, it should run the file through the PHP interpreter. On the filesystem itself, any PHP scripts can be called foo.php or simply foo (i.e. have no extension at all).
In a standard Apache configuration, DefaultType is set to text/plain. This may have made sense in 1996, but these days pretty much everything is HTML.
The DefaultType approach is substantially more efficient than Options MultiViews because there is no need to do readdir() calls to figure out what file to serve out. It gives the added flexibility that if you ever rewrite part of your site to use a different technology (switch to mod_perl or whatever) that the links won't rot. And it's 4 bytes less to send for each GET request!
Hiding the .php extension doesn't really make your site any safer, because anyone who wants to hack your site can simply guess that you're running PHP behind the scenes and attempt well-known exploits. This could be best described as "security through obscurity" which gives engineers a warm and fuzzy feeling, but isn't really any more secure.