DBM text export formats

DBM-style flat files are great, but sometimes it’s hard to deal with binary formats. Being able to have data in a textual format can be very handy for transferring between different platforms, modifying in your favorite editor, crunching with standard tools like grep, etc. There are a couple of text-friendly DBM export out there, each with advantages and disadvantages.

Suppose you have a DBM hash that contains two key/value pairs like the following:

one=Hello

two=Goodbye

Good ol’ BerkleyDB comes with a utility called db_dump which lets you export a binary database to a text format and then use the equivalent db_load tool to import the data. It’s easiest to see your data when you use the -p option. Here’s a simple database with two records:

format=print

type=hash

h_nelem=5

db_pagesize=512

HEADER=END

one

Hello

two

Goodbye

db_dump is pretty easy to use as it is, but it becomes a little cumbersome when you’ve got non-printable characters to display (control characters, newlines, and anything that isn’t 7-bit clean). You end up with a dump that looks like this:

Erev Pesach

\d7\a2\d6\b6\d7\a8\d6\b6\d7\91 \d7\a4\d6\bc\d6\b6\d7\a1\d6\b7\d7\97

Tu B'Shvat

\d7\98\d7\95\d6\bc \d7\91\d6\bc\d6\b4\d7\a9\d7\81\d6\b0\d7\91\d6\b8\d7\98

Bamidbar

\d7\91\d6\bc\d6\b0\d7\9e\d6\b4\d7\93\d6\b0\d7\91\d6\bc\d6\b7\d7\a8

You don’t lose any information, but it becomes impossible to work with when you’ve got UTF-8 data and you want to be able to edit it in your favorite Unicode-savvy editor.

Perl hackers are probably familiar with Data::Dumper, which looks like this:

$VAR1 = {

'one' => 'Hello',

'two' => 'Goodbye'

};

Data::Dumper is easier than db_dump to use with your favorite text-centric tools, and it has the advantage that it keeps each key/value pair together on the same line (handy for grep). Unfortunately, it’s very Perl-centric; you’re intended to load the data by calling eval(). I suppose you could write a parser in C that understood that format pretty easily and you could use it in non-Perl programs.

On one of the mailing lists at work today someone mentioned the cdb constant database format. I took a look at the page and was amused to see the cdbdump record format. It’s an interesting alternative to db_dump’s format and works nicely with UTF-8.

+3,5:one->Hello

+3,7:two->Goodbye

 

It’s a pretty concise format, and it’s totally 8-bit friendly. The key and data may contain any characters, including colons, dashes, newlines, and nulls. As a consequence it’s very easy to write generators and parsers for this format, and they’re typically very efficient. Like Data::Dumper, it keeps key/value together on the same line.

One disadvantage of the cdbdump format is that it uses explicit integer lengths, so it’s not very friendly for editing data in a text editor (every change you make requires that you fixup the beginning of the line).

Introducing Chomsky

Introducing Chomsky, 2nd Edition Stuck in bed this weekend with a cold, Ariella bought me a copy of Introducing Chomsky, 2nd Edition. Between naps and feeling sorry for myself, I managed to read the entire book.

Since I’ve already read a few of Steven Pinker’s books, the first half of this book (Chomsky’s theories on linguistics) seemed a little redundant to me. However, since I had never before studied Chomsky the political/social theorist, the second half of the book was definitely enlightening.

Compared to my experience with other books in this series (Introducing Semiotics and Introducing Postmodernism in particular), I was a little disappointed with Introducing Chomsky. This one was not nearly as clear or entertaining as the others.

Still, I haven’t given up on Icon Books. I’m ordering a used copy of Introducing Einstein so it can sit on my shelf next to A Brief History of Time and the dozens of other books I’ll get around to reading “someday.”

UC Teaching Assistants Strike

ucla_seal_color.gif Ariella and 10,999 other University of California teaching assistants will strike on Thursday.

Although Los Angeles has had its share of labor disputes recently (the MTA bus strike was settled a couple of weeks ago and the Ralph’s/Vons/Albertson’s strike/lockout has been going on for 8 weeks now), this thing is gonna be statewide. TAs from all eight UC campuses will stop grading papers and exams. Just in time for finals!

[Update 03 December: the union has reached a tentative agreement with the UC so the strike has been called off.]

Off-Roading at Hungry Valley

I went off-roading today with Rob and Dan at the Hungry Valley State Vehicular Recreation Area. We were out in the 4WD practice area for all of 15 minutes when Rob decided to get us stuck in the mud:

PB300637.jpg

Some other brilliant guy tried to get us out, but he got stuck even worse:

PB300641.jpg

A guy in a Suburban eventually came by and pulled us out with a tow rope. Set free, we proceeded to drive around the mud for a little while longer, being careful to avoid the steep part of the pit. Twenty minutes later the Suburban itself got stuck so we backed in and towed it out. Do you see a pattern here?

We left the mud area and hit the trails. After roaming around the park for another 4 hours, we finally headed home. It was a total blast.

Cancelled again

apache-feather.gif Ugh. I’ve been cancelled again.

I was planning to give an Apache-releated talk to a bunch of Yahoo! engineers in Sunnyvale next Thursday, but somone else has stolen the conference room from me.

Yahoo! Engineering has a great tradition of “Thursday Lunchtime Tech Talks.” Every Thursday we reserve a big conference room upstairs from the cafeteria and someone gives a tutorial or a presentation on a technical subject while a handful of interested engineers listen and learn. It’s a great opportunity to meet people you’ve only corresponded with over email, and very frequently you learn something about how to solve a particular problem that comes in handy.

In my 5 years at the company I’ve probably done 6 or 7 talks, mostly relating to ad-targeting, Apache, PHP, and our proprietary package-management tool.

I’ve actually been planning to give this Apache talk since early September and have had 3 separate dates reserved for this talk. But each time I’ve been postponed by a few weeks due to a room conflict. Next week there’s some sort of three-day conference that wants to use the room.

So, I’ve been rescheduled for January 8, 2004. I wonder if I’ll get preempted by the Q4 2003 earnings announcement…

Thanksgiving travel preview

I flew back from SJC to LAX tonight and got a taste of what air travel is going to look like tomorrow, the busiest air travel day of the year.

  • It was busy. Very busy. Two years after 9/11 it appears that Americans are no longer afraid of airplanes.
  • Instead of the usual business traveller crowd, my plane was filled with college students. Every single one of them had an iPod. I simultaneously felt very part of American pop culture (I’ve had one for almost 6 months now) and also a little bit old. No Nomad or Dell Jukeboxes to be found. The hot peroxide blonde sitting in the middle seat to me was talking to the guy next to her (an Amerasian with acne) about what he was majoring in. Both had iPods. And apparently both think the quarter system is better than the semester system.
  • TSA and the airport rent-a-cop security screeners seemed ready for the deluge of travellers. There were many more metal detectors and X-ray machines open than there usually are. Throughput was very good.

If you’re travelling on Wednesday, get there early and don’t forget your iPod.

Emacs and *.tar.bz2 files

I’ve been seeing more and more bzip2-compressed files these days, and I want to be able to open these files in GNU Emacs without the need to decompress them.

About 10 years ago I copied someone’s ~/.emacs file and noticed some mention of a crypt++ module. I asked them what it did and they told me that it allowed them to view *.gz files in an Emacs buffer by doing the decoding on-the-fly. Combined with the built-in support for tar-mode, this is very handy.

I’ve been using it to browse *.tar.gz and *.tgz files since the emacs-19.34 days, but today I needed to view the source code of php-4.3.4.tar.bz2 and it didn’t work.

After a little bit of investigation, it turned out that the ancient version of crypt++.el I’ve been using for the past decade didn’t support bzip2 files. So I went and grabbed the latest version (2.92, released January 2003) and added the following 6 lines to my ~/.emacs file:

(require 'crypt++)

(modify-coding-system-alist 'file "\\.gz\\'" 'no-conversion)

(modify-coding-system-alist 'file "\\.Z\\'" 'no-conversion)

(modify-coding-system-alist 'file "\\.gpg\\'" 'no-conversion)

(modify-coding-system-alist 'file "\\.bz\\'" 'no-conversion)

(modify-coding-system-alist 'file "\\.bz2\\'" 'no-conversion)

Viola! It works!

It turns out that GNU Emacs 20 and later has native support for handling compressed files, so all you really need is this:

(auto-compression-mode t)

But I’m still kinda attached to using crypt++ because I occasionally use the built-in PGP support.