ApacheCon: XML and I18N

tower-of-babel.jpg After lunch, I headed off to see a slightly-off-topic presentation on XML and Internationalization by Yahoo!’s own Sander van Zoest.

van Zoest began discussing Unicode overall, covered the UTF variants such as UTF-8, and BOM (Byte Order Marks). Moving into XML itself, he gave an overview of the intended use of the xml:lang tag and the ISO-639-2 language codes and ISO-3166 country codes.

XML also supports Numerical Character References such as € for the EURO SIGN (€). These may be expressed in hex (&#xHHHH;) or decimal (&#DDDD;) and always contain the Unicode code point value (regardless of which UTF scheme the document is encoded in). NCRs can be handy when you need to represent a the character does not exist in the document’s encoding scheme, but can’t be used in element or attribute names, or in CDATA and PIs.

The presentation gave examples of how to do character set transformations using XSLT, Perl 5.8, and Java.

We got the usual pitch to use tags with semantic value such as <important> instead of <b> and to use sytlesheets to do presentation instead of cluttering up the markup.

van Zoest failed to mention the perennial question on this subject: why the heck does “i18n” stand for “internationalization”? The answer is that there are 18 letters between the beginning “i” and ending “n” in the word “internationalization”.

ApacheCon: Apache 2.0 Filters

202a_illus.jpg I went to the talk on Apache 2.0 Filters by Greg Ames. I already knew a little bit about the Apache 2.0 Filtered I/O model from a session at the O’Reilly Open Source conference, but since I’m going to sit down and write one Real Soon Now, I oughta learn a little more about ’em.

Ames gave an example of the bucket brigade API by showing snippets of the mod_case_filter code. Looks pretty elegant and simple.

He then went into details about why the naive approach fails miserably from a performance perspective, and showed some examples of how to do a filter the Right Way. It turns out that this is really complex.

Aside from all sorts of error conditions, there are lots of things to worry about with resource management. Do you allocate too much virtual memory? How often do you flush? Looks like you need to regularly flip between blocking and non-blocking I/O to do this right.

Filters that need to examine every single byte of the input (such as things that parse HTML or other tags) are even more complicated because you need to allocate private memory when a tag spans more than one bucket. Bleh. My mod_highlight_filter idea is going to be difficult to implement.

Ames then talked about the mod_ext_filter module from the Apache Directive perspective. I would’ve rather seen some slides about the implementation of this rather complex filter, but perhaps that would have been too technical for the audience.

He also discussed some tricks about how to debug Apache more easily with gdb and using 2 Listen statements (as a way to avoid starting with the -X option), and some useful gdb macros for your ~/.gdbinit file which make examining the bucket brigade easier. Cool tips. I guess I misjudged the technical level here; he probably skipped the implementation of mod_ext_filter because it would’ve taken too much time.

ApacheCon: Watching the Alpha Geeks

oreilly_header_part1.gif Tim O’Reilly gave this morning’s keynote address. (Actually, what’s bizarre is that he’s actually giving the keynote address right now and I’m blogging via an 802.11b WLAN.)

O’Reilly spoke about early adopters being a good predictor for technology trends. He compared the models of Napster and MP3.com (distributed vs. client-server models) and how it often takes someone to look at technology in a completely different way in order to make progress — cheap local storage and always-on networking are changing the computing landscape. He says the killer apps of today are all network applications: web, mail, chat, music sharing.

The best laugh came at the moment when he said that he thinks the phrase “Paradigm Shift” gets overused so much that it is starting to generate groans the way the phrase “The Knights Who Say Nee!” has done for years.

O’Reilly also spoke about applications migrating towards platforms. For example, instant messaging is an application (AIM, Y! Messenger, MSN Messenger) but it is becoming a platform (Jabber, AIM-iChat integration).

scrambled-sm.jpg Before the talk, I actually had breakfast with O’Reilly (the restaurant was packed and we both grabbed seats at the same table) and we talked about the world of free software. He suggested writing an article for the O’Reilly newsletter about Y! moving away from yapache (our Apache web server variant) towards a more standard Apache server. (I mentioned our weird mod_yahoo_ccgi thing which is like a crippled version of mod_so, but we invented our own because back in 1996 we had a need for DSOs before they were directly supported in the Apache server.) After the PHP news “debacle” last month, we’ll see if I can get permission to write openly about the subject.

Heading to ApacheCon

logo_203x93.gif I’m off to Vegas tonight for the ApacheCon conference. I’m looking forward to learning about what people are doing with Apache 2.0.

I’ve got this idea for a cool Apache 2.0 filter which I’m planning to work on in January. Basically, I’d like to be able to hilight search terms on a web page. You could tell what the user searched for by looking at the HTTP Referer header for patterns like http://www.google.com/search?q=search+terms+here and then highlight them as they appear in the page.

We could certainly use such a feature at work internally, and it would be yet another incentive for folks to make the switch from Apache 1.3 to Apache 2.0.

Hiding .php extensions in Apache

Here’s a neat little trick. If you want to serve out PHP scripts without showing the .php extension, you can add something like this to your httpd.conf file:

DefaultType application/x-httpd-php

DirectoryIndex index index.html

Those directives will tell Apache that if there is no extension on a file, it should run the file through the PHP interpreter. On the filesystem itself, any PHP scripts can be called foo.php or simply foo (i.e. have no extension at all).

In a standard Apache configuration, DefaultType is set to text/plain. This may have made sense in 1996, but these days pretty much everything is HTML.

The DefaultType approach is substantially more efficient than Options MultiViews because there is no need to do readdir() calls to figure out what file to serve out. It gives the added flexibility that if you ever rewrite part of your site to use a different technology (switch to mod_perl or whatever) that the links won’t rot. And it’s 4 bytes less to send for each GET request!

Hiding the .php extension doesn’t really make your site any safer, because anyone who wants to hack your site can simply guess that you’re running PHP behind the scenes and attempt well-known exploits. This could be best described as “security through obscurity” which gives engineers a warm and fuzzy feeling, but isn’t really any more secure.