ApacheCon: XML and I18N

tower-of-babel.jpg After lunch, I headed off to see a slightly-off-topic presentation on XML and Internationalization by Yahoo!’s own Sander van Zoest.

van Zoest began discussing Unicode overall, covered the UTF variants such as UTF-8, and BOM (Byte Order Marks). Moving into XML itself, he gave an overview of the intended use of the xml:lang tag and the ISO-639-2 language codes and ISO-3166 country codes.

XML also supports Numerical Character References such as € for the EURO SIGN (€). These may be expressed in hex (&#xHHHH;) or decimal (&#DDDD;) and always contain the Unicode code point value (regardless of which UTF scheme the document is encoded in). NCRs can be handy when you need to represent a the character does not exist in the document’s encoding scheme, but can’t be used in element or attribute names, or in CDATA and PIs.

The presentation gave examples of how to do character set transformations using XSLT, Perl 5.8, and Java.

We got the usual pitch to use tags with semantic value such as <important> instead of <b> and to use sytlesheets to do presentation instead of cluttering up the markup.

van Zoest failed to mention the perennial question on this subject: why the heck does “i18n” stand for “internationalization”? The answer is that there are 18 letters between the beginning “i” and ending “n” in the word “internationalization”.

ApacheCon: Apache 2.0 Filters

202a_illus.jpg I went to the talk on Apache 2.0 Filters by Greg Ames. I already knew a little bit about the Apache 2.0 Filtered I/O model from a session at the O’Reilly Open Source conference, but since I’m going to sit down and write one Real Soon Now, I oughta learn a little more about ’em.

Ames gave an example of the bucket brigade API by showing snippets of the mod_case_filter code. Looks pretty elegant and simple.

He then went into details about why the naive approach fails miserably from a performance perspective, and showed some examples of how to do a filter the Right Way. It turns out that this is really complex.

Aside from all sorts of error conditions, there are lots of things to worry about with resource management. Do you allocate too much virtual memory? How often do you flush? Looks like you need to regularly flip between blocking and non-blocking I/O to do this right.

Filters that need to examine every single byte of the input (such as things that parse HTML or other tags) are even more complicated because you need to allocate private memory when a tag spans more than one bucket. Bleh. My mod_highlight_filter idea is going to be difficult to implement.

Ames then talked about the mod_ext_filter module from the Apache Directive perspective. I would’ve rather seen some slides about the implementation of this rather complex filter, but perhaps that would have been too technical for the audience.

He also discussed some tricks about how to debug Apache more easily with gdb and using 2 Listen statements (as a way to avoid starting with the -X option), and some useful gdb macros for your ~/.gdbinit file which make examining the bucket brigade easier. Cool tips. I guess I misjudged the technical level here; he probably skipped the implementation of mod_ext_filter because it would’ve taken too much time.

ApacheCon: Watching the Alpha Geeks

oreilly_header_part1.gif Tim O’Reilly gave this morning’s keynote address. (Actually, what’s bizarre is that he’s actually giving the keynote address right now and I’m blogging via an 802.11b WLAN.)

O’Reilly spoke about early adopters being a good predictor for technology trends. He compared the models of Napster and MP3.com (distributed vs. client-server models) and how it often takes someone to look at technology in a completely different way in order to make progress — cheap local storage and always-on networking are changing the computing landscape. He says the killer apps of today are all network applications: web, mail, chat, music sharing.

The best laugh came at the moment when he said that he thinks the phrase “Paradigm Shift” gets overused so much that it is starting to generate groans the way the phrase “The Knights Who Say Nee!” has done for years.

O’Reilly also spoke about applications migrating towards platforms. For example, instant messaging is an application (AIM, Y! Messenger, MSN Messenger) but it is becoming a platform (Jabber, AIM-iChat integration).

scrambled-sm.jpg Before the talk, I actually had breakfast with O’Reilly (the restaurant was packed and we both grabbed seats at the same table) and we talked about the world of free software. He suggested writing an article for the O’Reilly newsletter about Y! moving away from yapache (our Apache web server variant) towards a more standard Apache server. (I mentioned our weird mod_yahoo_ccgi thing which is like a crippled version of mod_so, but we invented our own because back in 1996 we had a need for DSOs before they were directly supported in the Apache server.) After the PHP news “debacle” last month, we’ll see if I can get permission to write openly about the subject.

Heading to ApacheCon

logo_203x93.gif I’m off to Vegas tonight for the ApacheCon conference. I’m looking forward to learning about what people are doing with Apache 2.0.

I’ve got this idea for a cool Apache 2.0 filter which I’m planning to work on in January. Basically, I’d like to be able to hilight search terms on a web page. You could tell what the user searched for by looking at the HTTP Referer header for patterns like http://www.google.com/search?q=search+terms+here and then highlight them as they appear in the page.

We could certainly use such a feature at work internally, and it would be yet another incentive for folks to make the switch from Apache 1.3 to Apache 2.0.

Israelis love “West Wing”

Nov-2002-West-Wing-small.jpg We have many friends living in Israel this year. Last year we used to have a regular gathering on Wednesday evenings to watch The West Wing on our projection TV.

Now they’ve gotta wait months for someone to send ’em a tape from the States. They recently got together to watch a couple of hours and sent us this digital photo.

We’ll send ’em another tape soon. In the meantime, I can hardly wait for this week’s episode:

“Swiss Diplomacy”

The Iranian leader makes a secret request of Bartlet to allow his son to be flown to the United States for life-saving surgery.

I sure miss my friends.

Opt-out of Telemarketers and Junk Mail

operation-opt-out.gif There are already several good tools for combatting e-mail spam, but what about telephone calls and U.S. Postal Service junk mail?

Few people are aware of an organization called the Direct Marketing Association which is used by many (but not all) telemarketers and junk mailers. From what I understand, the DMA rents a centralized database of names and phone numbers to reputable marketers.

What’s important about the DMA is that they provide something called the Mail Preference Service and the Telephone Preference Service which are opt-out lists that consumers can join. If you tell the DMA that you don’t want junk mail or telephone calls by registering with these lists, they will refuse to give your name to their customers. Pretty cool.

There are other groups to contact as well, such as the Big 3 credit reporting agencies (Equifax, Experian and Trans Union). After you ask these companies to stop sharing your name & address, and suddenly all of those credit card solicitations disappear!

The easiest way to get rid of all of this offline spam is to go Operation Opt-Out and print out the 7 or 8 forms and mail them off to the various agencies. The $3.00 investment in stamps will go a long way towards reducing aggravation.

We like cars, the cars that go boom

Grab It! - L'Trimm My parents bought me an external Firewire CD burner for a birthday present (back in June) but the thing didn’t work. When I met them for dinner last night, they gave me a new one which although is a wimpy USB 1.1 model, at least it’s a name brand (Iomega Zip CD650). Turns out that it actually works. I guess there’s something to be said for buying name brands.

So this morning I finally made the “Car Songs” CD that I’ve been thinking about since this summer! Here’s what’s on the CD:

  • Barenaked Ladies – In the Car
  • Beach Boys – Fun, Fun, Fun
  • Beatles – Drive My Car
  • Billy Ocean – Get Out of My Dreams Get into My Car
  • Don McLean – American Pie
  • Geggy Tah – Whoever You Are
  • L’Trimm – Cars that go boom
  • Prince – Little Red Corvette
  • Roger Miller – King of the Road
  • Rose Royce – Car Wash
  • Sammy Hagar – I Can’t Drive 55
  • Tracy Chapman – Fast Car
  • Trio – Da Da Da
  • Willie Nelson – On The Road Again

Yes, I know that “Da Da Da” isn’t actually a song about cars or driving, but those Volkswagen commercials have infused the song with a whole new set of car imagery. Kinda like I always think of United Airlines when I hear Rhapsody in Blue.

Secret lyrics to the Ashim Theme

I recently got a cell phone that allows you to program custom ring tones.

So I programmed in the first 8 bars of the Ashim Theme that Mike Cafarella and I composed (with apologies to the Norwegian Folk Song) back in 1996.

I can’t reveal the whole set of lyrics to the Ashim Theme due to the blood oath that I swore to Caf back on that overcast day in Providence. But here’s an excerpt:

As I walk through the dark forest,

Swimming through waves of terror.

Oh my goodness, there he is now.

It’s the spirit of Roberto.

I must run, I must flee,

He will kill me (holy cow)!

Have I escaped? Am I free now?

Is this heaven, not hell?

I must try to find Nirvana: Taco Bell.

Now, every time my phone rings I think fondly of Ashim.

How do we fix the blogrot problem?

I’m concerned about linkrot in blogs. Blog entries tend to mention interesting stuff by hyperlinking to news articles, websites, and other blogs. Since it’s so easy to create a link to something (rather than excepting a relevant paragraph), you’re pretty much guaranteed to get a 404 when you try to visit that link at a later date. This might not be such an obvious problem on the surface because blogs are an ephemeral, fresh and up-to-the-minute medium, and linkrot usually takes a few weeks or months to set in.

But blogs are also supposed to serve as a diary or journal, so you should be able to go back 6 months from now and revisit all of the cool stuff you used to think about. That’s when linkrot is going to burn you the worst, because you’ll want to re-read an article or another person’s blog, and most likely it won’t be there anymore.

I think it would be cool if MovableType or some other popular blogging software could provide a PermaLink feature for external content. I’m thinking of something like the Google cache, which would mirror the content locally and add a header that would say something like:

This is Michael J. Radwin’s blog’s cache of http://www.newsfactor.com/perl/story/19912.html. This cache is the snapshot that I took of the page as I wrote my blog.

The page may have changed since that time. Click here for the current page.

It would work even better if there was some clever integration with your browser that says to visit this HREF first, but if you get a 404, try this alternative HREF (which happens to point to a snapshot of the page in your blog archives). I’m sure XHTML has something like this when you go beyond xlink:type=”simple” but I doubt browsers do anything intelligent with it.

Heck, even blogs themselves are prone to linkrot. I recently decided to switch my MoveableType settings to use Date-based archives instead of Individual Item archives because I rarely write more than one blog per day. Clicking on that convenient “Rebuild Site” button caused everything to get rebuilt. But what if someone had already linked one of my old Individual archives that’s no longer there? Apparently that PermaLink feature is not so “perma”.

I’m encouraged that people are working on solving the linkrot problem in a generalized way but not everyone is going to care to do it right.

Email from Nigeria – RE: URGENT BUSINESS PROPOSAL

I’ve gotten a handful of the Nigerian money-laundering scam emails over the past year, although it seems like I’m getting one or two a week now. I’m reminder of a story in Wired Magazine about some Netizen who decided to catch the scammers on film. Apparently this person had a lot of time to waste because they exchanged over 50 emails. Read the Wired synopsis now for some quick entertainment, and read the full story when you’re really bored (and don’t mind a dirty sense of humor).

But spam is a not-so-funny problem. I publish my email address pretty openly on my website, so I end up getting a lot of spam. I have been filtering email it for a couple of years with some home-made procmail recipes. I came up with a heuristic in 1999 that worked quite well:

  1. if the subject line happened to mention toner cartridges, it probably was spam.

  2. if my email address (or one of the mailing lists I subscribed to) was on the To: or Cc: line, it probably wasn’t spam.
  3. everything else was probably spam.

This heuristic worked pretty well, but had a few drawbacks. Most notably, when my friends would throw a big party and Bcc: me, the invitation would end up in my possible-spam folder.

In addition, spammers started getting more intelligent. Beginning around 2001, they started sending individual spam messages that were actually addressed directly to me! So my underlying heuristic was starting to fail me.

To solve the “Bcc” and “unknown sender” problem, I came up with a new plan. I was going to load my addressbook into a DBM hash and add a procmail rule that classified anything that came from this list of approved senders as guaranteed non-spam, and anything from someone unknown as likely spam. Then, I would add people to the DBM hash one-by-one when I confirmed that they were an actual friend of mine.

I was about to get started on this project but I hadn’t found the time to do it. It didn’t even occur to me to look for someone else’s software to solve my problem. (This is the problem with being a software engineer. You know how to solve problems like this, and it’s so easy to do, that you often start working on a solution without checking to see if anyone else has done it yet. We call it re-inventing the wheel).

Luckily, before I could waste a whole bunch of my time, someone at work mentioned a nifty server-side spam filter called SpamAssassin. I took a look and installed it on my ISP. It’s not perfect, but it does a remarkable job of detecting spam, and it’s about 50 times better than anything I could’ve written.

SpamAssassin works well because it’s got a group of volunteers who are constantly updating a rules database that says what patterns in an email make it more (or less) likely to be a spam message. For example, if the email message mentions “herbal Viagra” or toner cartridges, it’s likely to be spam. It even has something called a whitelist, which matches my idea of allowing people in your addressbook to send you mail. But I’m not even using that feature, because the 2.4x series of SpamAssassin works well enough out-of-the-box.

Yahoo! has a completely different system that it uses for Yahoo! Mail. Instead of running pattern detection on the email (an effective but labor-intensive solution), our former Chief Scientist (who recently left Yahoo! for a job at Amazon.com) came up with an automated algorithm. Unfortunately, I can’t say much about Udi’s approach without giving away trade secrets, but you’ll be able to read the patent when it’s finally approved.

ISPs should take a long, serious look at providing SpamAssassin as a service to their users. It won’t catch every piece of spam, but it’s probably got the best ratio of low sysadmin effort yielding a high quantity of spam detection.