Yahoo! Year in Review

Here’s my own personal Yahoo! Year in Review (i.e. the highlights of my job in 2003):

  • In January, I started my new role as an Engineering Manager, leading my own 3-person team. The change of job responsibilities was just the thing to get me charged up about work again.
  • A month later I finally got rid of my home office and started working in Santa Monica at the Yahoo! LAUNCH office.
  • In March I got the good news that my Targeted Advertising Patent application had been published by the U.S. Patent and Trademark Office. We submitted the application back in the Fall of 2001, but it takes the USPTO a long time to review these things. Hopefully the patent will issue in 2004.
  • That same month I got news from my boss that my group (developer tools and core software infrastructure) was going to be growing in size. We started interviewing candidates, and by the beginning of August my group had grown from 3 to 7 engineers.
  • In July, I spoke at OSCON 2003 in Portland. I got to have a beer with my buddy Sam Jackson who I hadn’t seen in about 5 years, and I met a few cool folks like David Sklar and Adam Trachtenberg.
  • I spent a good deal of time in July and August training Andrei and Ryan, the two newest members of my team.
  • I gave my “One Year of PHP at Yahoo!” talk at PHPCon West 2003 in Santa Clara. This conference was more schmoozing than sessions; I spent quality time with Ze’ev, Rinat, Zak, George, Sterling, Thies, Shane, James, Luke & Laura, and local Yahoo!s Andrei and Brian. I also met Brian from Microsoft, who seemed like a really nice guy.
  • I spent most of the remaining part of the year working on annual performance reviews. It was amazingly difficult and time-consuming, but I’m hopeful that it was worth the effort. The opportunity to reflect upon my group’s work over the past year made me proud of our accomplishments.
  • As a result of my new people-management job, I didn’t manage to write too much code this past year. Our CVSdb checkin database shows that I added 7,257 LOC to the codebase this year, compared to 21,928 LOC in 2002.

There is only an hour and a half until the New Year, so I think that’s enough for 2003.

DBM text export formats

DBM-style flat files are great, but sometimes it’s hard to deal with binary formats. Being able to have data in a textual format can be very handy for transferring between different platforms, modifying in your favorite editor, crunching with standard tools like grep, etc. There are a couple of text-friendly DBM export out there, each with advantages and disadvantages.

Suppose you have a DBM hash that contains two key/value pairs like the following:

one=Hello

two=Goodbye

Good ol’ BerkleyDB comes with a utility called db_dump which lets you export a binary database to a text format and then use the equivalent db_load tool to import the data. It’s easiest to see your data when you use the -p option. Here’s a simple database with two records:

format=print

type=hash

h_nelem=5

db_pagesize=512

HEADER=END

one

Hello

two

Goodbye

db_dump is pretty easy to use as it is, but it becomes a little cumbersome when you’ve got non-printable characters to display (control characters, newlines, and anything that isn’t 7-bit clean). You end up with a dump that looks like this:

Erev Pesach

\d7\a2\d6\b6\d7\a8\d6\b6\d7\91 \d7\a4\d6\bc\d6\b6\d7\a1\d6\b7\d7\97

Tu B'Shvat

\d7\98\d7\95\d6\bc \d7\91\d6\bc\d6\b4\d7\a9\d7\81\d6\b0\d7\91\d6\b8\d7\98

Bamidbar

\d7\91\d6\bc\d6\b0\d7\9e\d6\b4\d7\93\d6\b0\d7\91\d6\bc\d6\b7\d7\a8

You don’t lose any information, but it becomes impossible to work with when you’ve got UTF-8 data and you want to be able to edit it in your favorite Unicode-savvy editor.

Perl hackers are probably familiar with Data::Dumper, which looks like this:

$VAR1 = {

'one' => 'Hello',

'two' => 'Goodbye'

};

Data::Dumper is easier than db_dump to use with your favorite text-centric tools, and it has the advantage that it keeps each key/value pair together on the same line (handy for grep). Unfortunately, it’s very Perl-centric; you’re intended to load the data by calling eval(). I suppose you could write a parser in C that understood that format pretty easily and you could use it in non-Perl programs.

On one of the mailing lists at work today someone mentioned the cdb constant database format. I took a look at the page and was amused to see the cdbdump record format. It’s an interesting alternative to db_dump’s format and works nicely with UTF-8.

+3,5:one->Hello

+3,7:two->Goodbye

 

It’s a pretty concise format, and it’s totally 8-bit friendly. The key and data may contain any characters, including colons, dashes, newlines, and nulls. As a consequence it’s very easy to write generators and parsers for this format, and they’re typically very efficient. Like Data::Dumper, it keeps key/value together on the same line.

One disadvantage of the cdbdump format is that it uses explicit integer lengths, so it’s not very friendly for editing data in a text editor (every change you make requires that you fixup the beginning of the line).

Introducing Chomsky

Introducing Chomsky, 2nd Edition Stuck in bed this weekend with a cold, Ariella bought me a copy of Introducing Chomsky, 2nd Edition. Between naps and feeling sorry for myself, I managed to read the entire book.

Since I’ve already read a few of Steven Pinker’s books, the first half of this book (Chomsky’s theories on linguistics) seemed a little redundant to me. However, since I had never before studied Chomsky the political/social theorist, the second half of the book was definitely enlightening.

Compared to my experience with other books in this series (Introducing Semiotics and Introducing Postmodernism in particular), I was a little disappointed with Introducing Chomsky. This one was not nearly as clear or entertaining as the others.

Still, I haven’t given up on Icon Books. I’m ordering a used copy of Introducing Einstein so it can sit on my shelf next to A Brief History of Time and the dozens of other books I’ll get around to reading “someday.”

UC Teaching Assistants Strike

ucla_seal_color.gif Ariella and 10,999 other University of California teaching assistants will strike on Thursday.

Although Los Angeles has had its share of labor disputes recently (the MTA bus strike was settled a couple of weeks ago and the Ralph’s/Vons/Albertson’s strike/lockout has been going on for 8 weeks now), this thing is gonna be statewide. TAs from all eight UC campuses will stop grading papers and exams. Just in time for finals!

[Update 03 December: the union has reached a tentative agreement with the UC so the strike has been called off.]