Michael J. Radwin

Tales of a software engineer who keeps kosher and hates the web.

Building Data Warehouses with MySQL

John Ashenfelter spoke about Building Data Warehouses with MySQL. After surveying the audience with some questions about what database technology people use and how much data they store, he described what he felt was the one and only reason to create a data warehouse: to answer business questions.

The first two-thirds of the talk discussed DW in general and made very little reference to MySQL in particular.

One of the Ashenfelter’s “if you only learn 3 things from this talk” statements was architect for a data warehouse, but build a data mart. Data marts answer “vertical”-type questions. Each are focused on answering one narrow business process. But marts should share a consistent view of the data from the warehouse. You can think of a data warehouse as a collection of standardized data marts.

Getting your definitions consistent is important. What’s an order? The salesperson might think of an order as “I sold 59 baseball cards and I got $100” but the shipping depratment might send it out in 3 different shipments from 2 different order fulfillment centers. How many “orders” is that?

Also important to standardize on how the DW represents business policies and practices. For example, is revenue booked at sale or collection? How do you define “top customer”? Someone who buys more than half a million dollars a year, or someone who buys more than once a week? Gets these questions answered by the business people so when they use the DW they know what they’re getting.

An interesting sidebar: never use anything “meaningful” for a key. Product numberings/SKUs will be guaranteed to change, merger or acquisition with another company means that you’ll have to do customer id reassignments. Recommendation: use an int (not a varchar) which gives you flexibility for the inevitable change.

Ashenfelter described using a Star schema (not a snowflake schema) for representing the data. The DW should be centered around Facts which have Dimensions, but be sure not to normalize your Dimensions or you’ll end up doing joins of 17 different tables for your queries. It may drive traditional relation database engineers crazy, but denormalized data means fewer joins and faster performance. Some extra redundancy is worth that performance boost.

Next, we went through an example of a DW for Vmeals, a take-out/catering delivery service for businesses. We went through 6 steps for designing the DW:

  1. Plan the data warehouse design
  2. Create corporate metadata standards
  3. Pick a business process
  4. Determine the grain of the fact table
  5. Detail the dimensions of the facts
  6. Find the relevant facts

Speaking about MySQL in particular, Ashenfelter mentioned that MySQL 4.0 has greatly improved the speed of bulk insert, which is important for the E-T-L (Export-Transform-Load) part of data warehousing. His basic model is to get data in batch from Microsoft SQL Server or Oracle via some sort of dump, do some transformation (for example, to denormalize the data), then load the data into MySQL.

A couple of interesting notes: using a staging environment is a good way to provide efficiency and concurrency (so folks can still query yesterday’s data while you’re preparing today’s data). It also gives you a hook to do validation tests. For example, you could sum all of the January sales and compare whether or not the total matched what the computed total was yesterday. If it’s July and the data changed, it indicates that something with your source data is wrong, and it’s better to flag it so someone can investigate instead of releasing the data to production and giving the business folks an inconsistent view.

As the talk started wrapping up, Ashenfelter mentioned several Open Source tools (mostly written in Java) that work with MySQL for data warehousing. For E-T-L, he suggested CloverETL or Enhydra Octopus. For Reporting, he recommended Jasper Reports, jFreeReport, and DataViz. For OLAP tools, he mentioned Mondrian, JPivot, and BEE. For Delivery Frameworks, you could think about using Jetspeed, Webworks, or PHP-Nuke.