Adventures with DB_File::Lock

I like the Unix DBM file format (a.k.a BerkeleyDB). I use it for static data (like the zip code-to-latitude/longitude database for the Hebcal Interactive Jewish Calendar) and for dynamic data (such as the subscriber database for the Mountain View High School Alumni Internet Directory).

BerkeleyDB is also great because it has many language interfaces. I can access the same DB files in both Perl and PHP.

My high school alumni directory subscriber database has experienced corruption a few times recently. It’s a good thing I also keep a daily text backup of the database in RCS because it makes it easy to rebuild the DB.

But it’s obvious to me that the underlying cause of the problem is concurrent access that isn’t protected by mutual exclusion. Heck, I wrote the code back in 1995 when I didn’t know better.

So I’ve gotta go add some locking code to the 25 scripts that manage the site.

However, older versions of BerkeleyDB (such as the one installed on my ISP) don’t natively support locking, so I’ve gotta use flock for concurrency. No problem; it’s relatively easy to turn every occurance of this:


use DB_File;

my(%DB);

tie(%DB, 'DB_File', $file, O_RDWR|O_CREAT, 0644, $DB_HASH);

$DB{'foo'} = 'bar';

untie(%DB);

into something that looks like this:


use DB_File;

use Fcntl qw(:DEFAULT :flock);

my(%DB);

my($db) = tie(%DB, 'DB_File', $file, O_RDWR|O_CREAT, 0644, $DB_HASH);

defined($db) || die "Can't tie $file: $!\n";

my($fd) = $db->fd;

open(DB_FH, "+<&=$fd") || die "dup $!";

unless (flock (DB_FH, LOCK_EX)) { die "flock: $!" }

$DB{'foo'} = 'bar';

flock(DB_FH, LOCK_UN);

undef $db;

undef $fd;

untie(%DB);

close(DB_FH);

Bingo. Problem seems to be fixed. No more DB corruption.

But then, a few weeks later, I get DB corruption again. Ugh. Turns out that I managed to fix 24 of the scripts, but there’s one that I occasionally run by hand (the one that removes someone from the directory) that I forgot to add locking code to. With flock, it only takes one script to screw it up.

So last night I was about to go through the scripts and update them, but reading the DB_File manpage, they point out a possible problem with the classic “tie the db, dup the fd, then flock” approach. So fixing the 25th script to use the same locking scheme won’t necessarily solve the problem either. Doh!

Reading a little further down the manpage, I see a reference to a simple CPAN module called DB_File::Lock that transparently does flocking when you tie and untie the DB. It’s perfect for what I need.

Now I can simply do a search-and-replace throughout the entire codebase and change all DB_File references to DB_File::Lock, and get rid of whatever dup/flock stuff I used to use.


use DB_File::Lock;

my(%DB);

tie(%DB, 'DB_File::Lock', $file, O_RDWR|O_CREAT, 0644, $DB_HASH, 'write');

$DB{'foo'} = 'bar';

untie(%DB);

I’ve also considered moving the code from DBM files into MySQL. My ISP started offering limited MySQL access for an additional buck a month, and relational DBs tend to solve the concurrent access problem in a much more elegant (and consistent) way.

Unfotunately, it would be too much work. I don’t want to rewrite all of my 8-year-old Perl code that serializes an alumni record (just a bunch of key=value pairs) into a delimited string. And the DB access parts of the code aren’t very well abstracted, so switching from a simple hash DB format to a more structured multi-column format is going to be trickier than it seems.

Someday when I find the time to do a complete rewrite I’ll use MySQL as the backing store. And I’ll use that opportunity to get rid of all of my perl4-isms and replace them with appropriate perl5 constructs. Heck, if I delay long enough, perhaps I can go straight from perl4 to perl6! :-)

For now, DB_File::Lock is good enough.

Mikel Maron: Reactive Links

A superb idea today from Yahoo! alumnus Mikel Maron:

Reactive Links. Anytime someone click-thrus on these redirect links, the service records that action… more active links could be big and red and quiet links could small and blue, or whatever you like. These links change their character depending on their usage. [Brain Off]

It reminds me of a little bit of internal visualization our data mining group did where a modified version of the Yahoo! homepage showed a click-percentage count next to each hyperlink on the page. You could pretty easily see that people were always interested in clicking on certain elements on the page (such as the word “Free”) and that you could also induce users to try different Yahoo! services by occasionally highlighting one of them (by displaying them in bold or with a background color).

Changing the size of the links is another interesting visualization technique, but it can throw off the page layout so much that it becomes distracting and less helpful.

Jerry’s Guide to the World Wide Web

akebono.jpg At lunch today we were talking about trademarks and whether Yahoo! is a brand name or a generic term. Since it’s used in Chapter 1 of Gulliver’s Travels, it clearly pre-dates the web company. And the first use with an exclamation point probably comes from the Erasure song which was released in 1988 on The Innocents album.

We never quite sorted it out, but the discussion morphed into the history of the company. We wondered how many links there are still pointing at akebono.Stanford.Edu.

Now there’s one more. :-)

Hebrew Computing on Mac OS X

mac-osx-1.gif We’re thinking about buying a Mac.

One of the things that has been holding us up is lack of support for Hebrew software. Until Mac OS X 10.2 was released, the operating system didn’t even offer native support for Hebrew. However, we’re still waiting for some important applications (such as NisusWriter) to come out with OS X native releases.

Last week I saw an email to the hebrewcomputing Y! group which listed off a list of some good Hebrew software for “real Hebrew computing” on Mac OS X.

  • Mellel for word processing (full Hebrew support)

  • OS X Mail app for Hebrew email
  • Safari and Camino for Hebrew web browsing
  • iChat and icy juice for instant messaging in Hebrew
  • iCal for calendar with Hebrew support
  • OS X address book with it’s built in Hebrew support
  • Keynote with the Hebrew template and direction services for Hebrew presentations

Now all we need are OS X editions of the Gemara and Tanach.

Logfile analyzers

I’ve just started using The Webalizer to do logfile analysis for radwin.org and hebcal.com.

Back in 1998 when I first started hosting my own domain name, I wanted to see where people were coming from and what they were viewing, so I set up the wwwstat script as a cron job to generate statistics.

I’ve never really liked the reports it gives, so last month I downloaded Analog, which claims to be “the most popular logfile analyser in the world.” It has every feature you could possibly imagine including graphs and charts, search referrer statistics, and even 31 different language output options. Perhaps because it’s so feature-rich, it is very difficult to compile and configure. I never got around to fixing my cron jobs to use it, in part because I couldn’t figure out how to send it data from stdin.

Last night Dave Jeske asked about log statistics for his blog (since I’m hosting on my site for the time being) and I told him the URL to the crappy wwwstat page that I generate daily from cron. I warned him that my ISP rotates logfiles daily, so the page never shows more than the past 24 hours of statistics.

He pointed out that webalizer has a slick -p option that lets you preserve state so you can run it multiple times and it incrementally adjusts the statistics. Neat.

So I downloaded the source code, ran ./configure --prefix=/home/mradwin/local --with-etcdir=/home/mradwin/local/etc and make all install and I was up and running in 15 minutes!

It’s not as slick as Analog, but I don’t care. It does exactly what I want, and nothing more.

Thanks for the tip, Dave. Enjoy your new stats URL.

Dave Jeske: Linux on the Desktop

linux-penguin.jpg Dave Jeske, one the many brilliant Yahoo! alumni I know, is new to the blogging world. Here’s his second entry:

Linux on the Desktop is a long way off, here’s why. I’m a UNIX developer and proud of it. I love the stability, scriptability, and remote administration capabilities of UNIX. I’ve built everything from small scale scripts to large web-applications running on hundreds of machines. However, I’ve never run UNIX/X as… [unsolicitedDave]

I’m looking forward to reading more from him.

What Every Software Engineer Should Know About Patents

patent-leather-shoes.jpg Ariel Rogson from Marger Johnson & McCollom spoke at last night’s Los Angeles ACM meeting on software patents.

Here’s a brief outline of the main topics he covered:

  • What is a patent?
  • Patents vs. Copyright
  • 4 Requirements for a patent
  • Is software patentable?
  • Should I bother with a patent?
  • “Patent Pending”
  • Audience for the text of patent
  • Components of patent application
  • “Enablement” requirement
  • Deadlines for patenting
  • Prior Art
  • Provisional Patents
  • Financial Costs
  • How to draft a patent specification
  • Include source code in your application?
  • Open Source vs. Patents
  • Infringement
  • Defenses against infringement
  • Advice for managers

I already knew a whole bunch of this stuff since I’ve been through the process before and I’ve taken an Intellectual Property class at UCLA.

Something new I learned about was the “prior use” defense against infringement. Apparently it was created in 1999 but has yet to be tested in a court of law, in large part because there are some highly technical limitations associated with its use.

The way I understand it, the prior use defense may apply if you reduced to practice the invention at least one year before the filing date, AND you were using it commercially before the filing date. But apparently it’s tricky to use.

Rogson said that the more common defenses against infringement were either invalidity or non-infringement.

With invalidity, you argue that the examiner failed to consider some prior art that would have prevented the patent from issuing. The difficulty with this defense ist that the defendent has the burden of proof to show that the patent is invalid since a patent is presumed valid if it has issued.

A non-infringement defense argues that the patent is not infringed upon because the defendant is simply doing something different from what the patent describes.

Using passive voice for moral neutrality

It drives me crazy when I read a headline that says “3 die in bombing” and it turns out that one of the three people was the suicide bomber himself.

Instead, how about “2 killed,” or better yet, “2 murdered?”

The press often writes headlines in a way that imply that the suicide bomber just happened to be there and accidentally got killed like everyone else. But his death is not morally equivalent to the victims of the bombing. He is a murderer.

Murderers don’t deserve to get counted. They’re not victims; they’re criminals.

By using the passive voice, newspapers claim that they’re being objective. But really, what’s so subjective about condeming murder? There are moral absolutes in this world.