ApacheCon: Closing Keynote

In his keynote address “New Ways of Thinking About Security: Open Source Thinking in a Bunged-up World”, Richard Thieme spoke about the contrast between linear thinking and network thinking in society. He posits that the Open Source movement represents a new kind of freedom and that chaotic and continually evolving.

Thieme spoke about how members of the CIA and the KGB had more in common with each other than they did with their respective political environments. Even though we think of free-market and communist countries as being opposites, the suppositions and the schemas to understand and categorize the world used by the intelligence community set them apart from the rest of the communities. He made a parallel to Open Source networks of programmers.

He claimed that writing code is a form of leadership, because leadership is saying what you think of the world in a clear and visceral way. It doesn’t require structural authority. Rather, writing code is functional leadership. Since leadership has two components (saying and doing), coding is in fact a true expression of leadership because it both expresses ideas and it performs a function.

He also spoke about authorship and intellectual property rights, and how these concepts were completely foreign before the invention of the printing press. Centuries later, Open Source and distributed networking are working to undermine those concepts. How do you define property when you share the information back and forth?

Security, identity, borders, and intellectual property rights are a function of clear boundaries. But, Thieme says, boundaries are not clear (and they’re getting less clear). We are moving towards a collective identity, away from the nation-state.

He wrapped but by describing Richard Stallman as a saint (saying that all saints are a little crazy), that it takes someone of an obsessive-compulsive mind to make truly amazing things happen.

ApacheCon: TAP and the Semantic Web

rob-mccool.jpg Rob McCool, continuing in the spirit of the easy to understand but never-adopted Meta Content Framework and the standard but substantially harder to grok Resource Description Framework, presented TAP.

The overall problem is that there is a ton of data out there on the web, but it’s not in machine-understandable form. McCool is looking at addressing key problems of supporting a true web of data: query languages, canonical names, caching, and a system of trust (to avoid spammers).

On the query langauges front, McCool believes that SQL and XQL are overkill, but HTTP GET is not specific enough. So TAP defines a GetData protocol. It follows in the spirit of the DNS system, where you can use a gethostbyname() function to access the service. TAP uses RDF schemas to describe graphs of data, and SOAP as the over-the-wire protocol for querying.

McCool described a module called TAPache to implement the GetData protocol. In the same way that Apache provides an htdocs directory, it provides an RDF repository. His stated goal for TAPache is to be the “BIND” application for data.

Since Amazon.com and CDnow might have different identifiers for the same album, TAP doesn’t require using globally unique identifiers. But how do you tell the difference between “Michael Jackson” the musician and “Michael Jackson” your next-door neighbor? TAP addresses this using reference by desciption hen you want to do a query for “Michael Jackson”, you ask for someone whose firstName=”Michael” and lastName=”Jackson” and profession=”Musician” and who is the author of an album with title=”Thriller”.

When asked about the problem of matching “Donald Rumsfeld” and “Donald H. Rumsfeld” or “al Qaeda” and “al-Qaida”, McCool said that there are some decent algorithms for matching names that go beyond simple string comparisons. Sounds like a substantially difficult project to me. Is my laptop an “IBM 390X” or is it “390-X by IBM”?

An interesting sample application was a “related items” sidebar for news stories. In addition to doing simple Capitalized Words extraction from the document, you could envison something that used the RDF graphs to discover that Brett Favre was a football player and match that with eBay auctions for tickets for the Green Bay Packers.

ApacheCon: Zend Engine 2 and PHP 5

zend_logo.gif Ze’ev Suraski began by giving a brief technical history of the PHP language. PHP 1 and 2 were developed around 1995. PHP 3, which started using lex/yacc more efficiently, was released in June 1998, execute while parsing. PHP 4 (May 2000) greatly improved performance (swtiched to a “compile first/execute later” paradigm, added reference counting) and improved the web server and extension APIs.

Zend Engine 2 uses a Java-like object model. All objects will be passed by reference, not by value. Other improvments include $obj->foo()->bar() dereferencing, __clone(), destructors (objects may define a __destruct() function), and unified constructors, and static class variables. It also adds some support for namespaces, exceptions (try, catch, and throw). In the future, Zend Engine 2 might add multiple inheritance and private member variables.

Suraski went on to show some demos of the Zend Engine 2, pointing out how much easier it is to do things like design patterns when you have an objects-are-references system. He showed an example of a factory method that just didn’t work without adding 4 ampersands in strategic locations. We also saw demos of the clone, destructor, and debug backtrace features.

All of his demos used the Zend Studio IDE which seems to have pretty good syntax highlighting, integrated debugger, and a nice help system.

Although the static class variables are a great idea, I didn’t love the self::$foo syntax you had to use to access them. Coming from a C++ and Java background, I would expect that $this->$foo syntax to work on the class variable (instead of dynamically creating an instance variable of the same name).

ApacheCon: Wednesday Afternoon



Sander, Jade, Zak, Shane and I headed off to the Mirage for their buffet lunch. Coversation included dead end jobs at dental labs, LindowsOS, differences between Israeli and Diaspora Jews, the conference scene, and handling traffic surges on websites (Fifa World Cup, 9/11, slashdot).

After lunch, I had a drink with Randy Terbush. We talked about Apache 2.0, Tribal Knowledge, where Yahoo! is going with the Open Source movement, and a little about the ASF in general.

mod_perl 2.0

I caught the tail end of Stas Bekman’s talk on mod_perl 2.0. I own a copy of the classic Writing Apache Modules in Perl and C, but I’ve only read the C parts of the book. In short, mod_perl 2.0 looks like it’s going to be really useful when it’s done. I got the impression that it’s still not ready for Prime Time, but it’s closer than it was 4 months ago. All new technologies need some soak time to work out all of the bugs.


Advanced Topics in Module Design: Threadsafety and Portability

Aaron Bannert’s talk was about thread-safe Apache code and how to use the APR when writing modules that need to do fancy synchronization and locking. I didn’t have the energy to take good notes and the room was a little too warm which induced some degree of sleepiness. So I’ll just summarize the session as follows: writing portable thread-safe code is a pain in the neck, and the APR makes it slightly easier. But it’s still a pain.

ApacheCon: Sun and Open Source

sun-microsystems.jpg John Fowler, Sun’s Software CTO, spoke about Sun’s commitment to the Open Source movement.

Nothing earth-shattering. Usual corporate pitch about how we love the O-S movement, and look how much wonderful stuff we’ve opened up. Java Community Process 2.5 sounds kinda interesting.

What was more interesting was what Fowler didn’t say. I didn’t hear the words Sparc or Solaris mentioned once.

ApacheCon: Struts for MVC Web Apps

struts-feather.gif According to Craig McClanahan, writing a web application is more difficult than writing a traditional application for a couple of pretty simple reasons:

  1. HTTP is a stateless protocol, and applications need to maintain state

  2. Since the web is world-wide, applications are expected to be internationalized

To complicate matters more, building any large scale application requires a large set of skills: presentation, application (business logic), persistence (files, databases), and application deployment (networks, firewalls, PKI). McClanahan claims we need Model-View-Controller as a fundamental organizing principle in designing and developing these applications.

McClanahan gave a brief overview of the MVC design paradigm in general, and how to use it in the web context in particular.

Overall, the presentation was right on the money, but McClanahan missed the mark when he was talking about the Back button. His claim was that web applications are fundamentally different than surfing around visiting web pages, and that we need to train users not to use the Back button on their web browsers when they’re using a web application. I disagree wholeheartedly. The Back button is a very powerful metaphor for a couple several reasons:

  • The web is all about simplicity and a uniform interface. If it’s blue and underlined, you can click on it. If it looks like a GUI Widget (a button, text box, pull-down, etc.) you can use it the way you use other operating system Widgets. You can Search. And you can hit the Back button when you want to stop what you’re doing and go back to what you were doing before. In other words, web apps are not really that different from web content.

  • Users are accustomed to using the Back button. They’ve already invested a bunch of energy learning how to use their browsers, and the Back button is probably the 3rd-most commonly used UI feature (right up there with from clicking on links and scrolling, probably used more often than even typing strings into text boxes).
  • When a user hits the Back button, they are in control of the experience. If you try to force users to move through your web app using a specific flow because you want to be control, you’ll fail. You’ve got to respect your users; they’re not slaves to your web application. To be successful on the Internet you must recoginize that the user is always free to walk away at any moment, and hitting the Back button is just one way they can do it. Would you rather that the user close their web browser completely and start over by visiting one of your competitors?

Okay, enough ranting. Maybe I’ll buy McClanahan a drink tonight and try to convince him to spend some time reading Jakob Nielsen. I’ll get back to taking notes on the presentation now.

Sometime while I was composing my rant, McClanahan started talking about Struts. Struts focuses mostly on the Controller aspect of MVC, delegating the Model aspect to other Sun technologies like Enterprise Java Beans and JDBC, and letting you use JSP for the View component.

He used a web logon application as a motivating example to show each of the MVC layers. We saw some slides of JSP syntax for the View Layer (login.jsp), and XML config file that struts uses for the Controller layer which maps URI paths and “actions” to Java class names and parameters. One of the XML config files essentially tells Struts what classes it will need to instantiate. Another encodes rules about what the data should look like (username and password being non-empty, etc.)

It appears that Struts uses many levels of indirection to make your application as reusable as possible. For example, your business logic Java code shouldn’t import anything form org.apache.struts because that would bind it too tightly to the Struts framework. Similarly, you need to avoid importing javax.servlet into your business logic because that binds too closely to a web application. This level of abstraction aids in reusability, but it takes a lot of work to keep it clean.

ApacheCon: Audio and Apache

Madonna: Music Sander van Zoest, formerly of MP3.com, gave a great introduction to serving audio via Apache and other servers.

van Zoest described several different ways to deliver audio: HTTP downloading, HTTP streaming, Real Time Streaming Protocol (both on-demand streaming and live broadcast) and Windows Media Player’s MMS protocol. Since audio players seem to have pretty dumb HTTP implementations, it’s important to get things like MIME types exactly right.

He listed off a bunch of common audio formats and how to configure their extensions and MIME types correctly (pointing out that the correct MIME type for .mp3 files is audio/mpeg, but most websites get this wrong). He also spent a bit of time on the audio meta-formats such as M3U, SDP, ASX, and SMIL, which describe playlists. These formats are the bridge between your Apache server and your audio player. They are described in more detail in Sander’s online notes.

Next, we looked at configuring Apache to serve large audio files. In short, you usually need to set your MaxClients and TimeOut directives to large values. Also, at the operating system level, you probably need to increase the size of your TCP listen queues to avoid clients seeing “connection refused” messages.

We also examined the Shoutcast and Icecast protocols for streaming audio.

A couple of questions from the audience asked if anyone was writing Apache 2.0 Protocol modules for RTSP and RTP, or whether the RealNetworks new Helix initiative was working with the ASF. van Zoest answered that it seemed like there could be some real synergy there (using the same Apache server and configuration files to serve both HTTP and non-HTTP streaming audio) but that there hadn’t been much interaction between the two groups so far. He also mentioned licensing incompatibilities as a potential barrier to integration.

ApacheCon: Scalable Internet Architectures 2


High Availability and Load Balancing

Theo challenged the audience to recognize the difference between replicateable data and non-replicateable data. Again the theme of the right tool for the job came up. Replicateable data needs marginal protection so you can use commodity hardware. Non-replicateable data needs single-point reliability, so you should consider “Enterprise” hardware.

He then put up a picture of the typical two tier model of load-balancers, content web servers, image web servers and Master/Slave DBs. I’m glad to see that he pointed out the idea of splitting images out onto a separate web server; simple trick which can help you scale. The typical three tier architecture looks mostly the same, but adds Application Servers and some more load balancers. This picture looks a lot more like the Yahoo! architecture.

Theo described hardware- and software-based load-balancing alternatives, and the tradeoffs. I was tickled to see DNS Round Robin as one of the software load balancing choices. OmniIT seems to really like wackamole and mod_backhand. In general, he seemed to prefer free software solutions over the expensive black box hardware solutions. I guess this is probably because he makes money off his consulting practice by customizing all of that complicated software.

Diving into the software load-balancing altnernatives, he described a project called Walrus. Walrus tries to pick the right server cluster by taking advantage of something in the DNS RFC which says that clients are supposed to measure DNS latency and pick the “closest” DNS server. Eventually users migrate towards one DNS server, and those servers (east coast vs. west coast) return disparate sets of IP addresses. Walrus is great in theory, but DNS isn’t implemented consistently on all clients, so it doesn’t work universally.

Theo proposes using Shared IP for DNS servers (but not for Web clusters) and assigning the same IP address to your DNS servers in different locations. This only really works well if your network provider is the same in both places and willing to work with you to make it happen.

Log collection

George felt that the rsync/scp/ftp method of collecting logs was terrible. He doesn’t like the fact that it uses unicast, so if you need to copy the logs to more than one place you need to do it multiple times, and he really disliked the fact that you can’t run real-time analysis on the logs.

He examined using syslog as an alternative to support real-time logging to a loghost, but due to the fact that it’s built on top of UDP, it’s unreliable (which might not work well with your business requirements). Also, the syslog implementations on many hosts are inefficient.

Database logging solves the reliability, real-time and centralization problems, but all of that relational database overhead substantially slower than writing to a file. And all of those rows start to add up quickly. Imagine a website like Yahoo! with over 1.5 billion pageviews a day.

mod_log_spread does a reliable multicast approach which allows for realtime processing of log data. George pointed out that realtime processing is fantastic for helping to notify you of things like 500 Internal Server Errors so you know when to do some on-the-fly debugging in your production environment.

Finally, Theo demo’d a cool Cocoa app for seeing real-time web statistics using the Spread Daemon.

ApacheCon: Scalable Internet Architectures 1

floodgate-1_emanage.gif Theo and George Schlossnagle gave a 2 hour talk on a hodge-podge of a few topics for scaling large websites. I’ll split this into two blogs.

Low-hanging fruit: Apache 1.3 optimizations

First, George pointed out one of the easiest tricks to optimize a large website: turn KeepAlive Off. No surprise here; Yahoo! has been doing this for a long time. It’s very resource-intensive to keep Apache children around for each of your clients, and heavy-traffic sites can’t afford to do this (even it it makes the client experience marginally faster).

[Editorial comment: proponents of KeepAlive often point out that the overhead of establishing 13 TCP connections to fetch an HTML page plus a dozen images really sucks when you could simply have a single TCP connection. However, on really large sites, images are often hosted on a completely different host (i.e. images.amazon.com or us.yimg.com) so running KeepAlive on the dynamic HTML machine is pointless. Users usually spend several seconds reading a page before clicking on another link.]

Next, he pointed out that you should set SendBufferSize to your maximal page size (something like 40K or however large your HTML pages tend to be). This way you effectively send each page with a single write() call so your web app (running dynamic stuff like PHP or mod_perl) never needs to block waiting for the client to consume the data.

Use gzip compression as a transfer encoding. You spend more CPU cycles but save on bandwidth. For a large website, bandwidth is far more expensive than buying more CPU power.

Don’t use Apache 1.3 for static content: use Apache 2.0, thttpd, or tux. In other words, use the right tool for the job.

George went into more detail about setting up reverse proxies (also known as HTTP accelerators) to handle clients with slow connections in front of your dynamic content servers. He discussed a couple of different approaches using mod_proxy in conjunction with mod_rewrite and mod_backhand.

Application caching

George made a compelling argument that commercial caching appliances can never do as good of a job of caching your data as you can do yourself using tools like Squid. He pointed out tradeoffs between black-box caching products (don’t require any changes to your application) and application-integrated caching (highly efficient, but requires rewriting your app).

Application-integrated caching can use a convenient shared storage system (like an NFS-mounted disk) or can write to more efficient local storage on each host and use some sort of messaging system to communicate to your server pool when they need to dirty or invalidate their caches. This is not terribly difficult to do with PHP or Perl using Spread and the XML-RPC hooks (we saw about 4 or 5 slides on how to implement all that).

ApacheCon: Waka: a replacement for HTTP

waka.jpg Ok, I was a little misguided when I wrote earlier that XML/I18N was off-topic for an Apache conference. While not about the Apache server itself, these technologies are in fact widely used in today’s environment of HTTP and Apache.

But how about something completely different: a brand new protocol to replace HTTP? Whoa. (This is why it’s great to go to conferences. You get exposed to all of these neat ideas that you don’t always have time to think about as you’re doing your day job.)

Fielding covered a lot of ground in his 60 minutes. He started off discussing how Web Services is yet another example of how to solve the general problem of Enterprise Application Integration. What’s important about Web Services, Fielding points out, is that it helps to solve the integration problem. Ideally, once you have web services, you don’t have to do N^2 integrations for N corporate applications.

(Editorial comment: even though Web Services lets a bunch of different applications speak the same protocol, it doesn’t mean they understand the same stuff. PeopleSoft’s concept of what an Employee means is going to be different from what SAP thinks it means. Web Services can get the two applications in the same room as each other, but it can’t get them speaking the same language.)

Fielding went on to explain what’s great about HTTP as a protocol, and also pointed out some of the difficulties it presents. He described the REST architecture’s influence on the HTTP/1.1 spec, and then gave some further background on the HTTP protocol. He pointed out a few important limitations of HTTP/1.1:

  1. Overhead of MIME-style message syntax

  2. Head-of-line blocking on interactions
  3. Metadata unable to come after data
  4. Server can’t send unsolicited responses
  5. Low-power and bandwidth sensitive devices more severely impacted by verbosity

Fielding suggests that a new protocol standard could solve HTTP’s current problems in a generic way. “It’s not like we’re all going to keel over and die if we don’t get a replacement for HTTP, but it would be really valuable in some communities.” He proposes waka, which one could evision as a sort of HTTP/2.0. In fact, it takes advantage of the HTTP protocol upgrade feature that was implemented in HTTP/1.1. Waka is designed to match the efficiency of the REST architectural style.

Waka adds a handful of new verbs (methods) to HTTP:

  • RENDER – explicit support for transcoding content (display/print/speak)

  • MONITOR – notify me when resource state changes
  • Authoring methods (a la DAV)
  • Request control data – for asynchronicity, transactional support, and QoS

Fielding went on in much more detail about all of these cool features of waka. One of my favorites was the ability for clients to define macros. You could use this feature to define a macro for a User-Agent string, then avoid sending all of those bytes on future interactions. I also like the fact that due to the asynchronicity of the protocol, you can interleave data and metadata. This could be pretty handy if you realized that you needed to issue a Set-Cookie header in the middle of a response.

The transactional support is also pretty important. Imagine that you’re trying to make an online payment of a non-trivial amount of money, and your flaky internet connection drops in the middle of the request. Did the payment go through? These days we get things like email confirmation messages, or you can log back on when your internet connection comes back up later and see if the transaction is mentioned in the transaction history. Waka could provide more protocol-level support.

Lastly, Fielding points out that waka is very much a work in progress. It hasn’t been fully spec’d or implemented, but he’s working actively on it. Expect to hear more about it in the coming year or so.

[Update: Apparently I misunderstood Fielding’s comments about the N vs. N^2 integration problem. According to Jeff Bone, “The basic argument is that type-specific interfaces lead to O(N^2) integrations, while generic interfaces lead to O(N) integrations. Web Services as cast today (SOAP as RPC, etc.) have type-specific interfaces and O(N^2) integration complexity; truly RESTful Web Services would in contrast use generic interfaces, and have integration complexity O(N).”]