ApacheCon: Struts for MVC Web Apps

struts-feather.gif According to Craig McClanahan, writing a web application is more difficult than writing a traditional application for a couple of pretty simple reasons:

  1. HTTP is a stateless protocol, and applications need to maintain state

  2. Since the web is world-wide, applications are expected to be internationalized

To complicate matters more, building any large scale application requires a large set of skills: presentation, application (business logic), persistence (files, databases), and application deployment (networks, firewalls, PKI). McClanahan claims we need Model-View-Controller as a fundamental organizing principle in designing and developing these applications.

McClanahan gave a brief overview of the MVC design paradigm in general, and how to use it in the web context in particular.

Overall, the presentation was right on the money, but McClanahan missed the mark when he was talking about the Back button. His claim was that web applications are fundamentally different than surfing around visiting web pages, and that we need to train users not to use the Back button on their web browsers when they’re using a web application. I disagree wholeheartedly. The Back button is a very powerful metaphor for a couple several reasons:

  • The web is all about simplicity and a uniform interface. If it’s blue and underlined, you can click on it. If it looks like a GUI Widget (a button, text box, pull-down, etc.) you can use it the way you use other operating system Widgets. You can Search. And you can hit the Back button when you want to stop what you’re doing and go back to what you were doing before. In other words, web apps are not really that different from web content.

  • Users are accustomed to using the Back button. They’ve already invested a bunch of energy learning how to use their browsers, and the Back button is probably the 3rd-most commonly used UI feature (right up there with from clicking on links and scrolling, probably used more often than even typing strings into text boxes).
  • When a user hits the Back button, they are in control of the experience. If you try to force users to move through your web app using a specific flow because you want to be control, you’ll fail. You’ve got to respect your users; they’re not slaves to your web application. To be successful on the Internet you must recoginize that the user is always free to walk away at any moment, and hitting the Back button is just one way they can do it. Would you rather that the user close their web browser completely and start over by visiting one of your competitors?

Okay, enough ranting. Maybe I’ll buy McClanahan a drink tonight and try to convince him to spend some time reading Jakob Nielsen. I’ll get back to taking notes on the presentation now.

Sometime while I was composing my rant, McClanahan started talking about Struts. Struts focuses mostly on the Controller aspect of MVC, delegating the Model aspect to other Sun technologies like Enterprise Java Beans and JDBC, and letting you use JSP for the View component.

He used a web logon application as a motivating example to show each of the MVC layers. We saw some slides of JSP syntax for the View Layer (login.jsp), and XML config file that struts uses for the Controller layer which maps URI paths and “actions” to Java class names and parameters. One of the XML config files essentially tells Struts what classes it will need to instantiate. Another encodes rules about what the data should look like (username and password being non-empty, etc.)

It appears that Struts uses many levels of indirection to make your application as reusable as possible. For example, your business logic Java code shouldn’t import anything form org.apache.struts because that would bind it too tightly to the Struts framework. Similarly, you need to avoid importing javax.servlet into your business logic because that binds too closely to a web application. This level of abstraction aids in reusability, but it takes a lot of work to keep it clean.

ApacheCon: Audio and Apache

Madonna: Music Sander van Zoest, formerly of MP3.com, gave a great introduction to serving audio via Apache and other servers.

van Zoest described several different ways to deliver audio: HTTP downloading, HTTP streaming, Real Time Streaming Protocol (both on-demand streaming and live broadcast) and Windows Media Player’s MMS protocol. Since audio players seem to have pretty dumb HTTP implementations, it’s important to get things like MIME types exactly right.

He listed off a bunch of common audio formats and how to configure their extensions and MIME types correctly (pointing out that the correct MIME type for .mp3 files is audio/mpeg, but most websites get this wrong). He also spent a bit of time on the audio meta-formats such as M3U, SDP, ASX, and SMIL, which describe playlists. These formats are the bridge between your Apache server and your audio player. They are described in more detail in Sander’s online notes.

Next, we looked at configuring Apache to serve large audio files. In short, you usually need to set your MaxClients and TimeOut directives to large values. Also, at the operating system level, you probably need to increase the size of your TCP listen queues to avoid clients seeing “connection refused” messages.

We also examined the Shoutcast and Icecast protocols for streaming audio.

A couple of questions from the audience asked if anyone was writing Apache 2.0 Protocol modules for RTSP and RTP, or whether the RealNetworks new Helix initiative was working with the ASF. van Zoest answered that it seemed like there could be some real synergy there (using the same Apache server and configuration files to serve both HTTP and non-HTTP streaming audio) but that there hadn’t been much interaction between the two groups so far. He also mentioned licensing incompatibilities as a potential barrier to integration.

Dinner with Uncle Steve

sushi.jpg My uncle Steve is in Las Vegas for Comdex, so we met up for dinner. He lives in Chicago so we don’t get to see him too often. Tonight was a real treat.

We enjoyed some pretty decent sushi at the San Remo hotel. Afterwards, we headed downtown to see the light show at the pedestrian mall. It was totally cheezy, but highly entertaining and free. The thrill was over 5 minutes after it started. I guess that’s classic Las Vegas.

Tomorrow night I’m hoping to catch a real show (like Blue Man Group or “O” or Sigfried and Roy).

ApacheCon: Scalable Internet Architectures 2

floodgate-1_emanage.gif

High Availability and Load Balancing

Theo challenged the audience to recognize the difference between replicateable data and non-replicateable data. Again the theme of the right tool for the job came up. Replicateable data needs marginal protection so you can use commodity hardware. Non-replicateable data needs single-point reliability, so you should consider “Enterprise” hardware.

He then put up a picture of the typical two tier model of load-balancers, content web servers, image web servers and Master/Slave DBs. I’m glad to see that he pointed out the idea of splitting images out onto a separate web server; simple trick which can help you scale. The typical three tier architecture looks mostly the same, but adds Application Servers and some more load balancers. This picture looks a lot more like the Yahoo! architecture.

Theo described hardware- and software-based load-balancing alternatives, and the tradeoffs. I was tickled to see DNS Round Robin as one of the software load balancing choices. OmniIT seems to really like wackamole and mod_backhand. In general, he seemed to prefer free software solutions over the expensive black box hardware solutions. I guess this is probably because he makes money off his consulting practice by customizing all of that complicated software.

Diving into the software load-balancing altnernatives, he described a project called Walrus. Walrus tries to pick the right server cluster by taking advantage of something in the DNS RFC which says that clients are supposed to measure DNS latency and pick the “closest” DNS server. Eventually users migrate towards one DNS server, and those servers (east coast vs. west coast) return disparate sets of IP addresses. Walrus is great in theory, but DNS isn’t implemented consistently on all clients, so it doesn’t work universally.

Theo proposes using Shared IP for DNS servers (but not for Web clusters) and assigning the same IP address to your DNS servers in different locations. This only really works well if your network provider is the same in both places and willing to work with you to make it happen.

Log collection

George felt that the rsync/scp/ftp method of collecting logs was terrible. He doesn’t like the fact that it uses unicast, so if you need to copy the logs to more than one place you need to do it multiple times, and he really disliked the fact that you can’t run real-time analysis on the logs.

He examined using syslog as an alternative to support real-time logging to a loghost, but due to the fact that it’s built on top of UDP, it’s unreliable (which might not work well with your business requirements). Also, the syslog implementations on many hosts are inefficient.

Database logging solves the reliability, real-time and centralization problems, but all of that relational database overhead substantially slower than writing to a file. And all of those rows start to add up quickly. Imagine a website like Yahoo! with over 1.5 billion pageviews a day.

mod_log_spread does a reliable multicast approach which allows for realtime processing of log data. George pointed out that realtime processing is fantastic for helping to notify you of things like 500 Internal Server Errors so you know when to do some on-the-fly debugging in your production environment.

Finally, Theo demo’d a cool Cocoa app for seeing real-time web statistics using the Spread Daemon.

ApacheCon: Scalable Internet Architectures 1

floodgate-1_emanage.gif Theo and George Schlossnagle gave a 2 hour talk on a hodge-podge of a few topics for scaling large websites. I’ll split this into two blogs.

Low-hanging fruit: Apache 1.3 optimizations

First, George pointed out one of the easiest tricks to optimize a large website: turn KeepAlive Off. No surprise here; Yahoo! has been doing this for a long time. It’s very resource-intensive to keep Apache children around for each of your clients, and heavy-traffic sites can’t afford to do this (even it it makes the client experience marginally faster).

[Editorial comment: proponents of KeepAlive often point out that the overhead of establishing 13 TCP connections to fetch an HTML page plus a dozen images really sucks when you could simply have a single TCP connection. However, on really large sites, images are often hosted on a completely different host (i.e. images.amazon.com or us.yimg.com) so running KeepAlive on the dynamic HTML machine is pointless. Users usually spend several seconds reading a page before clicking on another link.]

Next, he pointed out that you should set SendBufferSize to your maximal page size (something like 40K or however large your HTML pages tend to be). This way you effectively send each page with a single write() call so your web app (running dynamic stuff like PHP or mod_perl) never needs to block waiting for the client to consume the data.

Use gzip compression as a transfer encoding. You spend more CPU cycles but save on bandwidth. For a large website, bandwidth is far more expensive than buying more CPU power.

Don’t use Apache 1.3 for static content: use Apache 2.0, thttpd, or tux. In other words, use the right tool for the job.

George went into more detail about setting up reverse proxies (also known as HTTP accelerators) to handle clients with slow connections in front of your dynamic content servers. He discussed a couple of different approaches using mod_proxy in conjunction with mod_rewrite and mod_backhand.

Application caching

George made a compelling argument that commercial caching appliances can never do as good of a job of caching your data as you can do yourself using tools like Squid. He pointed out tradeoffs between black-box caching products (don’t require any changes to your application) and application-integrated caching (highly efficient, but requires rewriting your app).

Application-integrated caching can use a convenient shared storage system (like an NFS-mounted disk) or can write to more efficient local storage on each host and use some sort of messaging system to communicate to your server pool when they need to dirty or invalidate their caches. This is not terribly difficult to do with PHP or Perl using Spread and the XML-RPC hooks (we saw about 4 or 5 slides on how to implement all that).

ApacheCon: Waka: a replacement for HTTP

waka.jpg Ok, I was a little misguided when I wrote earlier that XML/I18N was off-topic for an Apache conference. While not about the Apache server itself, these technologies are in fact widely used in today’s environment of HTTP and Apache.

But how about something completely different: a brand new protocol to replace HTTP? Whoa. (This is why it’s great to go to conferences. You get exposed to all of these neat ideas that you don’t always have time to think about as you’re doing your day job.)

Fielding covered a lot of ground in his 60 minutes. He started off discussing how Web Services is yet another example of how to solve the general problem of Enterprise Application Integration. What’s important about Web Services, Fielding points out, is that it helps to solve the integration problem. Ideally, once you have web services, you don’t have to do N^2 integrations for N corporate applications.

(Editorial comment: even though Web Services lets a bunch of different applications speak the same protocol, it doesn’t mean they understand the same stuff. PeopleSoft’s concept of what an Employee means is going to be different from what SAP thinks it means. Web Services can get the two applications in the same room as each other, but it can’t get them speaking the same language.)

Fielding went on to explain what’s great about HTTP as a protocol, and also pointed out some of the difficulties it presents. He described the REST architecture’s influence on the HTTP/1.1 spec, and then gave some further background on the HTTP protocol. He pointed out a few important limitations of HTTP/1.1:

  1. Overhead of MIME-style message syntax

  2. Head-of-line blocking on interactions
  3. Metadata unable to come after data
  4. Server can’t send unsolicited responses
  5. Low-power and bandwidth sensitive devices more severely impacted by verbosity

Fielding suggests that a new protocol standard could solve HTTP’s current problems in a generic way. “It’s not like we’re all going to keel over and die if we don’t get a replacement for HTTP, but it would be really valuable in some communities.” He proposes waka, which one could evision as a sort of HTTP/2.0. In fact, it takes advantage of the HTTP protocol upgrade feature that was implemented in HTTP/1.1. Waka is designed to match the efficiency of the REST architectural style.

Waka adds a handful of new verbs (methods) to HTTP:

  • RENDER – explicit support for transcoding content (display/print/speak)

  • MONITOR – notify me when resource state changes
  • Authoring methods (a la DAV)
  • Request control data – for asynchronicity, transactional support, and QoS

Fielding went on in much more detail about all of these cool features of waka. One of my favorites was the ability for clients to define macros. You could use this feature to define a macro for a User-Agent string, then avoid sending all of those bytes on future interactions. I also like the fact that due to the asynchronicity of the protocol, you can interleave data and metadata. This could be pretty handy if you realized that you needed to issue a Set-Cookie header in the middle of a response.

The transactional support is also pretty important. Imagine that you’re trying to make an online payment of a non-trivial amount of money, and your flaky internet connection drops in the middle of the request. Did the payment go through? These days we get things like email confirmation messages, or you can log back on when your internet connection comes back up later and see if the transaction is mentioned in the transaction history. Waka could provide more protocol-level support.

Lastly, Fielding points out that waka is very much a work in progress. It hasn’t been fully spec’d or implemented, but he’s working actively on it. Expect to hear more about it in the coming year or so.

[Update: Apparently I misunderstood Fielding’s comments about the N vs. N^2 integration problem. According to Jeff Bone, “The basic argument is that type-specific interfaces lead to O(N^2) integrations, while generic interfaces lead to O(N) integrations. Web Services as cast today (SOAP as RPC, etc.) have type-specific interfaces and O(N^2) integration complexity; truly RESTful Web Services would in contrast use generic interfaces, and have integration complexity O(N).”]

ApacheCon: XML and I18N

tower-of-babel.jpg After lunch, I headed off to see a slightly-off-topic presentation on XML and Internationalization by Yahoo!’s own Sander van Zoest.

van Zoest began discussing Unicode overall, covered the UTF variants such as UTF-8, and BOM (Byte Order Marks). Moving into XML itself, he gave an overview of the intended use of the xml:lang tag and the ISO-639-2 language codes and ISO-3166 country codes.

XML also supports Numerical Character References such as € for the EURO SIGN (€). These may be expressed in hex (&#xHHHH;) or decimal (&#DDDD;) and always contain the Unicode code point value (regardless of which UTF scheme the document is encoded in). NCRs can be handy when you need to represent a the character does not exist in the document’s encoding scheme, but can’t be used in element or attribute names, or in CDATA and PIs.

The presentation gave examples of how to do character set transformations using XSLT, Perl 5.8, and Java.

We got the usual pitch to use tags with semantic value such as <important> instead of <b> and to use sytlesheets to do presentation instead of cluttering up the markup.

van Zoest failed to mention the perennial question on this subject: why the heck does “i18n” stand for “internationalization”? The answer is that there are 18 letters between the beginning “i” and ending “n” in the word “internationalization”.

ApacheCon: Apache 2.0 Filters

202a_illus.jpg I went to the talk on Apache 2.0 Filters by Greg Ames. I already knew a little bit about the Apache 2.0 Filtered I/O model from a session at the O’Reilly Open Source conference, but since I’m going to sit down and write one Real Soon Now, I oughta learn a little more about ’em.

Ames gave an example of the bucket brigade API by showing snippets of the mod_case_filter code. Looks pretty elegant and simple.

He then went into details about why the naive approach fails miserably from a performance perspective, and showed some examples of how to do a filter the Right Way. It turns out that this is really complex.

Aside from all sorts of error conditions, there are lots of things to worry about with resource management. Do you allocate too much virtual memory? How often do you flush? Looks like you need to regularly flip between blocking and non-blocking I/O to do this right.

Filters that need to examine every single byte of the input (such as things that parse HTML or other tags) are even more complicated because you need to allocate private memory when a tag spans more than one bucket. Bleh. My mod_highlight_filter idea is going to be difficult to implement.

Ames then talked about the mod_ext_filter module from the Apache Directive perspective. I would’ve rather seen some slides about the implementation of this rather complex filter, but perhaps that would have been too technical for the audience.

He also discussed some tricks about how to debug Apache more easily with gdb and using 2 Listen statements (as a way to avoid starting with the -X option), and some useful gdb macros for your ~/.gdbinit file which make examining the bucket brigade easier. Cool tips. I guess I misjudged the technical level here; he probably skipped the implementation of mod_ext_filter because it would’ve taken too much time.

ApacheCon: Watching the Alpha Geeks

oreilly_header_part1.gif Tim O’Reilly gave this morning’s keynote address. (Actually, what’s bizarre is that he’s actually giving the keynote address right now and I’m blogging via an 802.11b WLAN.)

O’Reilly spoke about early adopters being a good predictor for technology trends. He compared the models of Napster and MP3.com (distributed vs. client-server models) and how it often takes someone to look at technology in a completely different way in order to make progress — cheap local storage and always-on networking are changing the computing landscape. He says the killer apps of today are all network applications: web, mail, chat, music sharing.

The best laugh came at the moment when he said that he thinks the phrase “Paradigm Shift” gets overused so much that it is starting to generate groans the way the phrase “The Knights Who Say Nee!” has done for years.

O’Reilly also spoke about applications migrating towards platforms. For example, instant messaging is an application (AIM, Y! Messenger, MSN Messenger) but it is becoming a platform (Jabber, AIM-iChat integration).

scrambled-sm.jpg Before the talk, I actually had breakfast with O’Reilly (the restaurant was packed and we both grabbed seats at the same table) and we talked about the world of free software. He suggested writing an article for the O’Reilly newsletter about Y! moving away from yapache (our Apache web server variant) towards a more standard Apache server. (I mentioned our weird mod_yahoo_ccgi thing which is like a crippled version of mod_so, but we invented our own because back in 1996 we had a need for DSOs before they were directly supported in the Apache server.) After the PHP news “debacle” last month, we’ll see if I can get permission to write openly about the subject.

Heading to ApacheCon

logo_203x93.gif I’m off to Vegas tonight for the ApacheCon conference. I’m looking forward to learning about what people are doing with Apache 2.0.

I’ve got this idea for a cool Apache 2.0 filter which I’m planning to work on in January. Basically, I’d like to be able to hilight search terms on a web page. You could tell what the user searched for by looking at the HTTP Referer header for patterns like http://www.google.com/search?q=search+terms+here and then highlight them as they appear in the page.

We could certainly use such a feature at work internally, and it would be yet another incentive for folks to make the switch from Apache 1.3 to Apache 2.0.