Michael J. Radwin

Tales of a software engineer who keeps kosher and hates the web.

DBM text export formats

DBM-style flat files are great, but sometimes it’s hard to deal with binary formats. Being able to have data in a textual format can be very handy for transferring between different platforms, modifying in your favorite editor, crunching with standard tools like grep, etc. There are a couple of text-friendly DBM export out there, each with advantages and disadvantages.

Suppose you have a DBM hash that contains two key/value pairs like the following:

one=Hello

two=Goodbye

Good ol’ BerkleyDB comes with a utility called db_dump which lets you export a binary database to a text format and then use the equivalent db_load tool to import the data. It’s easiest to see your data when you use the -p option. Here’s a simple database with two records:

format=print

type=hash

h_nelem=5

db_pagesize=512

HEADER=END

one

Hello

two

Goodbye

db_dump is pretty easy to use as it is, but it becomes a little cumbersome when you’ve got non-printable characters to display (control characters, newlines, and anything that isn’t 7-bit clean). You end up with a dump that looks like this:

Erev Pesach

\d7\a2\d6\b6\d7\a8\d6\b6\d7\91 \d7\a4\d6\bc\d6\b6\d7\a1\d6\b7\d7\97

Tu B'Shvat

\d7\98\d7\95\d6\bc \d7\91\d6\bc\d6\b4\d7\a9\d7\81\d6\b0\d7\91\d6\b8\d7\98

Bamidbar

\d7\91\d6\bc\d6\b0\d7\9e\d6\b4\d7\93\d6\b0\d7\91\d6\bc\d6\b7\d7\a8

You don’t lose any information, but it becomes impossible to work with when you’ve got UTF-8 data and you want to be able to edit it in your favorite Unicode-savvy editor.

Perl hackers are probably familiar with Data::Dumper, which looks like this:

$VAR1 = {

'one' => 'Hello',

'two' => 'Goodbye'

};

Data::Dumper is easier than db_dump to use with your favorite text-centric tools, and it has the advantage that it keeps each key/value pair together on the same line (handy for grep). Unfortunately, it’s very Perl-centric; you’re intended to load the data by calling eval(). I suppose you could write a parser in C that understood that format pretty easily and you could use it in non-Perl programs.

On one of the mailing lists at work today someone mentioned the cdb constant database format. I took a look at the page and was amused to see the cdbdump record format. It’s an interesting alternative to db_dump’s format and works nicely with UTF-8.

+3,5:one->Hello

+3,7:two->Goodbye

 

It’s a pretty concise format, and it’s totally 8-bit friendly. The key and data may contain any characters, including colons, dashes, newlines, and nulls. As a consequence it’s very easy to write generators and parsers for this format, and they’re typically very efficient. Like Data::Dumper, it keeps key/value together on the same line.

One disadvantage of the cdbdump format is that it uses explicit integer lengths, so it’s not very friendly for editing data in a text editor (every change you make requires that you fixup the beginning of the line).