I’ve gotten a handful of the Nigerian money-laundering scam emails over the past year, although it seems like I’m getting one or two a week now. I’m reminder of a story in Wired Magazine about some Netizen who decided to catch the scammers on film. Apparently this person had a lot of time to waste because they exchanged over 50 emails. Read the Wired synopsis now for some quick entertainment, and read the full story when you’re really bored (and don’t mind a dirty sense of humor).
But spam is a not-so-funny problem. I publish my email address pretty openly on my website, so I end up getting a lot of spam. I have been filtering email it for a couple of years with some home-made procmail recipes. I came up with a heuristic in 1999 that worked quite well:
- if the subject line happened to mention toner cartridges, it probably was spam.
- if my email address (or one of the mailing lists I subscribed to) was on the To: or Cc: line, it probably wasn’t spam.
- everything else was probably spam.
This heuristic worked pretty well, but had a few drawbacks. Most notably, when my friends would throw a big party and Bcc: me, the invitation would end up in my possible-spam folder.
In addition, spammers started getting more intelligent. Beginning around 2001, they started sending individual spam messages that were actually addressed directly to me! So my underlying heuristic was starting to fail me.
To solve the “Bcc” and “unknown sender” problem, I came up with a new plan. I was going to load my addressbook into a DBM hash and add a procmail rule that classified anything that came from this list of approved senders as guaranteed non-spam, and anything from someone unknown as likely spam. Then, I would add people to the DBM hash one-by-one when I confirmed that they were an actual friend of mine.
I was about to get started on this project but I hadn’t found the time to do it. It didn’t even occur to me to look for someone else’s software to solve my problem. (This is the problem with being a software engineer. You know how to solve problems like this, and it’s so easy to do, that you often start working on a solution without checking to see if anyone else has done it yet. We call it re-inventing the wheel).
Luckily, before I could waste a whole bunch of my time, someone at work mentioned a nifty server-side spam filter called SpamAssassin. I took a look and installed it on my ISP. It’s not perfect, but it does a remarkable job of detecting spam, and it’s about 50 times better than anything I could’ve written.
SpamAssassin works well because it’s got a group of volunteers who are constantly updating a rules database that says what patterns in an email make it more (or less) likely to be a spam message. For example, if the email message mentions “herbal Viagra” or toner cartridges, it’s likely to be spam. It even has something called a whitelist, which matches my idea of allowing people in your addressbook to send you mail. But I’m not even using that feature, because the 2.4x series of SpamAssassin works well enough out-of-the-box.
Yahoo! has a completely different system that it uses for Yahoo! Mail. Instead of running pattern detection on the email (an effective but labor-intensive solution), our former Chief Scientist (who recently left Yahoo! for a job at Amazon.com) came up with an automated algorithm. Unfortunately, I can’t say much about Udi’s approach without giving away trade secrets, but you’ll be able to read the patent when it’s finally approved.
ISPs should take a long, serious look at providing SpamAssassin as a service to their users. It won’t catch every piece of spam, but it’s probably got the best ratio of low sysadmin effort yielding a high quantity of spam detection.