Perl Lightning Talks

The Sound of Music Wandering around after lunch, I stopped by the Perl Lightning Talks (slides) session. I was delighted to hear Autrijus Tang‘s five-minute rap These are 1% of my favourite CPAN… in Chinese, followed by an English translation sung to the tune of These are a few of my favorite things… from The Sound of Music.

It was incredible. Standing ovation.

Allison Randal’s lightning talk was a parody of Arlo Guthrie’s Alice’s Restaurant. “You can get anything you want / in Perl 6 development.” Clever, but Autrijus is a hard act to follow.

Also notable was Dave Rolsky’s talk on DateTime. Dave, like my friends Gabriel and Rachel, is from Minnesota.

OSCON Wednesday morning

I bounced around on Wednesday between a bunch of different sessions. In the morning, I did some last-minute touch-ups on my slides, then caught the tail end of John Coggeshall’s Interfacing Java / COM with PHP. After my talk on One Year of PHP at Yahoo! (slides), I grabbed some lunch in the speaker’s room. Shane asked me to collect some feedback from my co-workers about Komodo since they’re starting to think about what might go into their 3.0 release.

I showed up a little bit late for Adam Trachtenberg’s Introduction to Web Services in PHP: SOAP versus REST talk, but the room was packed so I couldn’t find a seat. So I stuck my head inside Zak and Monty’s Guided Tour of the MySQL Source Code to catch an updated version of what had changed since the users conference in April.

I also checked out Shane’s Introduction to PEAR talk, but the conference room had run out of seats again. Too bad they didn’t pick a bigger room for the PHP talks this year.

Tim O’Reilly: Paradigm Shift

oreilly_header_part1.gif Tim O’Reilly gave this morning’s keynote address, “The Open Source Paradigm Shift”. The talk was reminiscent of last year’s Watching The Alpha Geeks keynote at ApacheCon, although now he is able to say the phrase “paradigm shift” with a straight face.

Largely the talk was trying to make the case that we shouldn’t try to think about Open Source software in the traditional commercial software business model. Instead, we should recognize that the software (to some extent) has become a commodity, just like hardware has become a commodity. The true value in Open Source is the businesses that grow up around it. For example, nobody pays for Sendmail and Apache, yet thousands of ISPs make money from providing web/email hosting services for their customers.

His charge to the audience was to embrace the fact that Open Source software has become a commodity, and to start to think of it (and all of the services that have grown up around it) as a platform. If we can develop services that support collaboration and end-user customization, and the data flows freely enough, we’ll somehow find a way to feed our families.

McWireless

mcdonalds-wifi.gif I read today that McDonald’s is doing 802.11 in the San Francisco Bay Area.

“McDonald’s and Wayport Bring High-Speed Wireless Access to 75 Restaurants in the San Francisco Bay Area… McDonald’s is the first quick service restaurant to offer high-speed wireless access in a major market. The new Wi-Fi service will be available at approximately 75 McDonald’s restaurants around the Bay area with the first 55 going ‘live’ today.”

See also article on News.com.

I don’t care. I won’t eat there because they’re not vegetarian-friendly.

Building Data Warehouses with MySQL

John Ashenfelter spoke about Building Data Warehouses with MySQL. After surveying the audience with some questions about what database technology people use and how much data they store, he described what he felt was the one and only reason to create a data warehouse: to answer business questions.

The first two-thirds of the talk discussed DW in general and made very little reference to MySQL in particular.

One of the Ashenfelter’s “if you only learn 3 things from this talk” statements was architect for a data warehouse, but build a data mart. Data marts answer “vertical”-type questions. Each are focused on answering one narrow business process. But marts should share a consistent view of the data from the warehouse. You can think of a data warehouse as a collection of standardized data marts.

Getting your definitions consistent is important. What’s an order? The salesperson might think of an order as “I sold 59 baseball cards and I got $100” but the shipping depratment might send it out in 3 different shipments from 2 different order fulfillment centers. How many “orders” is that?

Also important to standardize on how the DW represents business policies and practices. For example, is revenue booked at sale or collection? How do you define “top customer”? Someone who buys more than half a million dollars a year, or someone who buys more than once a week? Gets these questions answered by the business people so when they use the DW they know what they’re getting.

An interesting sidebar: never use anything “meaningful” for a key. Product numberings/SKUs will be guaranteed to change, merger or acquisition with another company means that you’ll have to do customer id reassignments. Recommendation: use an int (not a varchar) which gives you flexibility for the inevitable change.

Ashenfelter described using a Star schema (not a snowflake schema) for representing the data. The DW should be centered around Facts which have Dimensions, but be sure not to normalize your Dimensions or you’ll end up doing joins of 17 different tables for your queries. It may drive traditional relation database engineers crazy, but denormalized data means fewer joins and faster performance. Some extra redundancy is worth that performance boost.

Next, we went through an example of a DW for Vmeals, a take-out/catering delivery service for businesses. We went through 6 steps for designing the DW:

  1. Plan the data warehouse design
  2. Create corporate metadata standards
  3. Pick a business process
  4. Determine the grain of the fact table
  5. Detail the dimensions of the facts
  6. Find the relevant facts

Speaking about MySQL in particular, Ashenfelter mentioned that MySQL 4.0 has greatly improved the speed of bulk insert, which is important for the E-T-L (Export-Transform-Load) part of data warehousing. His basic model is to get data in batch from Microsoft SQL Server or Oracle via some sort of dump, do some transformation (for example, to denormalize the data), then load the data into MySQL.

A couple of interesting notes: using a staging environment is a good way to provide efficiency and concurrency (so folks can still query yesterday’s data while you’re preparing today’s data). It also gives you a hook to do validation tests. For example, you could sum all of the January sales and compare whether or not the total matched what the computed total was yesterday. If it’s July and the data changed, it indicates that something with your source data is wrong, and it’s better to flag it so someone can investigate instead of releasing the data to production and giving the business folks an inconsistent view.

As the talk started wrapping up, Ashenfelter mentioned several Open Source tools (mostly written in Java) that work with MySQL for data warehousing. For E-T-L, he suggested CloverETL or Enhydra Octopus. For Reporting, he recommended Jasper Reports, jFreeReport, and DataViz. For OLAP tools, he mentioned Mondrian, JPivot, and BEE. For Delivery Frameworks, you could think about using Jetspeed, Webworks, or PHP-Nuke.

Designing and Creating Great Shared Libraries

Theodore Ts’o spoke about Designing and Creating Great Shared Libraries. It was a truly geeky talk, sprinkled with interesting historical trivia and packed with really useful guidelines and real-world examples.

He started out by describing his personal history with shared libraries by descibing his involvement with Kerberos V5 and the Linux Standards Base. As a motivating example, Ted pointed out a flaw in the ELF shared object model (used, for example, by Linux and FreeBSD) which doesn’t have the concept of namespaces for the symbols contained in shared objects. You can end up with a real headache if

  • Shared library “A” uses db2

  • Shared library “B” uses shared libraries “A” and db3
  • Application uses shared libraries “A”, “B”, and db4

Oftentimes this manifests itself in core dumps, because conflicting symbols from various different libraries collide with each other.

Most people understand API (Application Programming Interface) compatibility (issue: source-level compatibility) but many people don’t think about ABI (Application Binary Interface) compatibility (issue: link-time compatibility). In addition to keeping all of your C function signatures around, you’ve also got to make sure that none of the arguments (or return types) change.

From a portability perspective, Ted recommends that you “avoid global variables in shared libraries at all costs.” But in 2003, why care about portability? “There’s a disease going around where people think that all the world is Linux. It used to be that people thought that all the world is VAX, then all the world was Solaris, now all the world is Linux.”

Tangent: Performance-sensitive PIC (position independent code) libraries have a minor disadvantage on the x86 chip because there aren’t many general-purpose registers. Ted has noticed a 5% (or more) performance hit in some cases using -fPIC because the compiler essentially needs to reserve one of those registers for the relocation and can’t use it for algorithm-specific storage.

Another tangent: Try to remain bug-for-bug compatible. For example, the Linux libc (back in the version 4 days) changed at one point so that calling fclose() twice would result in a core dump. This was considered a good thing, since calling fclose() twice is considered wrong to begin with, and it would be better for the programmer to realize this sooner and fix the bug than to have some other mysterious bug appear that’s harder to track down. Apparently a well-known application (Netscape) incorrectly called fclose() twice, and when users upgraded their libc to the next minor release, it started crashing. Who’s fault was it? Netscape’s or the libc author?

After seeing a live demo of how to build a shared library and link an application against it, Ts’o spent quite a bit of time on a feature called ELF Symbol Versioning which allows you to provide multiple implementations of a function that get automatically selected by the application depending on when they linked against a shared library. He spoke about some of the differences between the Solaris and Linux implementations (mapfiles vs. the FSF __asm__(".symver ...") extension).

Ts’o warned the audience, that this technique should rarely be used. A couple of examples when it might be appropriate are for when you want to preserve bug-for-bug compatibility, or when a poorly-designed API is so enshrined that you can’t change it (i.e. getopt(), stdio functions, or strtok()).

During the break we chatted about whether the ELF Symbol Versioning feature would work on FreeBSD (which has been using ELF since the 3.0 release). Ts’o suggested that it would definitely work if we were using the GNU ld (which I don’t think we are) or that it might work if the FreeBSD folks had implemented the same functionality into the linker. Neither of us knew the answer, but a guy sitting nearby tried it out and said that it worked for him.

After the break, Ts’o switched gears to talk about How To Do It Right. In brief, he gave the following high-level guidelines:

  1. Use public and private header files. Only expose the parts of your API that you really need to expose.

  2. Use “namespaces” by prefixing all functions with a common string (such as “ext2fs_”)
  3. Avoid exposing data structures. Use opaque pointers and (non-inline) function accessors.
  4. If you must use public data structures, reserve spare data elements for later additions.

    int spare_int[8];

    long spare_long[8];

    void *spare_ptrs[8];

  5. If you must use public data structures, never reorder or delete structure fields. Add new fields to the end or use the reserved space.
  6. Use structure magic numbers. At the beginning of each data structure, store a unique 4-byte magic number. Library can do run-time checking to make sure that the right data structure is passed to the right program.
  7. Don’t use static variables.
  8. Be consistent about caller vs. callee memory allocation. Pros and Cons both ways, but Ts’o prefers callee allocation.
  9. Consider doing Object-Oriented programming in C. Simulate data encapsulation via opaque pointers, virtual functions with function pointers, and don’t bother with class inheritance (or use void * pointers or unions and type variables if you really need it).

We also saw some case studies of common APIs that were done wrong, such as gethostbyname() and getopt() and the types of headaches that they cause.

The last part of the talk focused on two topics: plug-ins and the GNU build tool chain. Ts’o gave a bunch of examples of how to use the dlfcn family of functions (dlopen(), dlsym(), and dlclose()) to develop a plug-in model for your application. We also got a high-level overview of autoconf, automake, and libtool which try to make it easier to write portable libraries and applications. It’s a good thing we didn’t spend too much time on these, as they can be extremely complicated beasts. Ts’o reminded us that these tools are designed with portability in mind; he pointed out that he’s seen projects that use these tools, yet only build on Linux!

“Urgent: MacOS X users, please turn off Rendezvous”

As Jeremy pointed out, the wireless network at OSCON was having problems this morning. During the break in the afternoon session, there were little laser-printed signs all around asking people to please disable Rendezvous as it’s causing interference. There were even instructions on how to turn it off!


sudo mDNSResponder stop

Perhaps the “Networking, simplified” motto should be renamed “Networking, all screwed up.”

Introduction to XSLT

Sitting in a small room with about 20 other folks, I’m hoping to learn something about XSL and XSLT. Our instructor for this half-day tutorial is Mike Fitzgerald of Wy’east Communications (whose website appears to be unavailable right now).

XSLT has been around for 3 or 4 years now, but this is the first time I’ve had an opportunity to look at it in any detail.

We started simple, with a basic transformation:


<!-- msg.xml -->

<msg/>

<!-- msg.xsl -->

<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">

<output method="text"/>

<template match="msg">Found it!</template>

</stylesheet>

On the surface, XSLT looks simple and elegant. But things get complicated very quickly. Over the course of the next 3 hours, Mike built upon the basics, teaching us the syntax and concepts involved.

XSLT uses a language called XPath to access or refer to parts of an XML document. I quickly grew tired of all the magic characters that XPath uses: /, //, @, {}, *, ::, [], |, etc. It seems to me that the designers of XPath had love affair with braces, brackets, and other operators. Instead of doing some sort of human-readable query language, you end up with stuff that looks like id("foo")/child::para[position()=5]. Haven’t these folks ever heard of something called whitespace?

Even though I tend to think of things procedurally, I really do like the idea of using a declarative language to describe a way of transforming data into presentation. I guess when you’re coding XPath every day, the idea is to keep things as terse as possible; XPath excels at that.

However, when you start using XSLT Functions and Variables, things start to look more & more like a scripting language like PHP or Perl. Apparently you can’t do everything with the declarative approach.

XSLT also seems very well integrated with other XML-related concepts. You’ve gotta be namespace-savvy to get things right in XSLT.

Overall, it was a very good session. The pace was a little slow for me, but he did a couple of things really well:

  1. Almost every single slide was accompanied by an example. Mike stepped through the source code line-by-line, and then ran the examples live to show us how it all worked.
  2. He handed out CD-ROMs of all of the examples (and 3 or 4 XSLT processors) at the beginning of the talk so we could try the examples right then & there on our laptops.