Michael J. Radwin

Tales of a software engineer who keeps kosher and hates the web.

Currently Viewing Posts in The Web Sucks

How do we fix the blogrot problem?

I’m concerned about linkrot in blogs. Blog entries tend to mention interesting stuff by hyperlinking to news articles, websites, and other blogs. Since it’s so easy to create a link to something (rather than excepting a relevant paragraph), you’re pretty much guaranteed to get a 404 when you try to visit that link at a later date. This might not be such an obvious problem on the surface because blogs are an ephemeral, fresh and up-to-the-minute medium, and linkrot usually takes a few weeks or months to set in.

But blogs are also supposed to serve as a diary or journal, so you should be able to go back 6 months from now and revisit all of the cool stuff you used to think about. That’s when linkrot is going to burn you the worst, because you’ll want to re-read an article or another person’s blog, and most likely it won’t be there anymore.

I think it would be cool if MovableType or some other popular blogging software could provide a PermaLink feature for external content. I’m thinking of something like the Google cache, which would mirror the content locally and add a header that would say something like:

This is Michael J. Radwin’s blog’s cache of http://www.newsfactor.com/perl/story/19912.html. This cache is the snapshot that I took of the page as I wrote my blog.

The page may have changed since that time. Click here for the current page.

It would work even better if there was some clever integration with your browser that says to visit this HREF first, but if you get a 404, try this alternative HREF (which happens to point to a snapshot of the page in your blog archives). I’m sure XHTML has something like this when you go beyond xlink:type=”simple” but I doubt browsers do anything intelligent with it.

Heck, even blogs themselves are prone to linkrot. I recently decided to switch my MoveableType settings to use Date-based archives instead of Individual Item archives because I rarely write more than one blog per day. Clicking on that convenient “Rebuild Site” button caused everything to get rebuilt. But what if someone had already linked one of my old Individual archives that’s no longer there? Apparently that PermaLink feature is not so “perma”.

I’m encouraged that people are working on solving the linkrot problem in a generalized way but not everyone is going to care to do it right.

Email from Nigeria – RE: URGENT BUSINESS PROPOSAL

I’ve gotten a handful of the Nigerian money-laundering scam emails over the past year, although it seems like I’m getting one or two a week now. I’m reminder of a story in Wired Magazine about some Netizen who decided to catch the scammers on film. Apparently this person had a lot of time to waste because they exchanged over 50 emails. Read the Wired synopsis now for some quick entertainment, and read the full story when you’re really bored (and don’t mind a dirty sense of humor).

But spam is a not-so-funny problem. I publish my email address pretty openly on my website, so I end up getting a lot of spam. I have been filtering email it for a couple of years with some home-made procmail recipes. I came up with a heuristic in 1999 that worked quite well:

  1. if the subject line happened to mention toner cartridges, it probably was spam.
  2. if my email address (or one of the mailing lists I subscribed to) was on the To: or Cc: line, it probably wasn’t spam.
  3. everything else was probably spam.

This heuristic worked pretty well, but had a few drawbacks. Most notably, when my friends would throw a big party and Bcc: me, the invitation would end up in my possible-spam folder.

In addition, spammers started getting more intelligent. Beginning around 2001, they started sending individual spam messages that were actually addressed directly to me! So my underlying heuristic was starting to fail me.

To solve the “Bcc” and “unknown sender” problem, I came up with a new plan. I was going to load my addressbook into a DBM hash and add a procmail rule that classified anything that came from this list of approved senders as guaranteed non-spam, and anything from someone unknown as likely spam. Then, I would add people to the DBM hash one-by-one when I confirmed that they were an actual friend of mine.

I was about to get started on this project but I hadn’t found the time to do it. It didn’t even occur to me to look for someone else’s software to solve my problem. (This is the problem with being a software engineer. You know how to solve problems like this, and it’s so easy to do, that you often start working on a solution without checking to see if anyone else has done it yet. We call it re-inventing the wheel).

Luckily, before I could waste a whole bunch of my time, someone at work mentioned a nifty server-side spam filter called SpamAssassin. I took a look and installed it on my ISP. It’s not perfect, but it does a remarkable job of detecting spam, and it’s about 50 times better than anything I could’ve written.

SpamAssassin works well because it’s got a group of volunteers who are constantly updating a rules database that says what patterns in an email make it more (or less) likely to be a spam message. For example, if the email message mentions “herbal Viagra” or toner cartridges, it’s likely to be spam. It even has something called a whitelist, which matches my idea of allowing people in your addressbook to send you mail. But I’m not even using that feature, because the 2.4x series of SpamAssassin works well enough out-of-the-box.

Yahoo! has a completely different system that it uses for Yahoo! Mail. Instead of running pattern detection on the email (an effective but labor-intensive solution), our former Chief Scientist (who recently left Yahoo! for a job at Amazon.com) came up with an automated algorithm. Unfortunately, I can’t say much about Udi’s approach without giving away trade secrets, but you’ll be able to read the patent when it’s finally approved.

ISPs should take a long, serious look at providing SpamAssassin as a service to their users. It won’t catch every piece of spam, but it’s probably got the best ratio of low sysadmin effort yielding a high quantity of spam detection.