The Cathedral, The Bazaar, and Apache

The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary A couple of weeks ago I read Eric Raymond’s The Cathedral and The Bazaar, a collection of essays about Open Source software. Raymond writes quite well for a techie (either that or he has a superb editor), and the book is coherent. I didn’t agree with most of the book, but I think it’s important to keep abreast of what other folks are writing about the space.

Despite my general disappointment in the book, Homesteading the Noosphere was quite good. In an essay describing how “ownership” of Open Source projects works, Raymond accurately states the previously unwritten code of behavior. Projects have owners. Contributions are welcome, especially when they’re written well. Project ownership can be transferred. Forking is strongly discouraged, although sometimes necessary as a last resort when the owner won’t accept changes and refuses to relinquish control of the project.

apache-feather.gif The Homesteading the Noosphere essay has actually prompted me to think a little bit about what’s going to happen with the Apache HTTP Server. The Apache Software Foundation is currently maintaining two separate versions of this product, 1.3.x and 2.0.x (and is also is working on 2.1.x). Although the 2.0 server has been stable and “recommended” for over a year now, there are lots of organizations that are still using the 1.3 platform. The ASF would like folks to move to 2.0, but the fact that they’re still making 1.3.x releases indicates that they recognize that migrating to 2.0 is no small undertaking. When there are security problems (and sometimes features) these changes are always made in 2.0 first, but need to get “backported” to 1.3.

But what if maintaining two separate products became too cumbersome and the ASF decided to stop making 1.3.x releases? I’ve wondered privately if any of the organizations that have a substantial investment in Apache/1.3 would want to take over the codebase (i.e. fork it). What would happen to the Apache community if someone decided to make an Apache/1.4 release? If the development was split across two projects, would both lose momentum (and therefore market share)? Would the vast majority of folks stand by the ASF and swallow the complexity of the 2.x server, while a “rogue” bunch of hackers simply caused social turmoil with 1.4 but never really made it successfully as a project? Or vice-versa?

Regardless of technical or social reasons, something called “Apache/1.4” couldn’t really happen without the ASF’s blessing. Although the code is Open Source so you could re-use it for another project, the Apache License is written in such a way that derivative products aren’t allowed to use the name “Apache”. But maybe there could be a Hopi/1.4 or a Mohican/1.4 HTTP server…

As Raymond writes in Homesteading the Noosphere, the natural motivation is to avoid forking unless absolutely necessary. In the case of Apache HTTP Server, there are decent technical and social alternatives to this last resort. So I’d hazard to guess that we’ll never see Apache/1.4.

Instead, we’ll probably see at most two more Apache/1.3 releases before the code is officially declared deprecated (which will probably happen right around the time that Apache/2.1 is released). Folks who have put off the 1.3-to-2.0 migration effort will take a serious look at a 1.3-to-2.1 jump, and the vast majority of them will make the move over the next two years. Sure, there will always be some laggards who are stuck using Apache/1.3.31, but by the end of 2005 their numbers will be so small that they’re not worth mentioning.

MySQL Scaling Pains

MySQL logo Jeremy Zawodny spoke Friday

morning about MySQL

Scaling Pains.

I’m still just waking up, so here are some abbreviated notes.

  • Security administration (don’t just GRANT ALL PRIVILEGES ON *.* TO

    someuser, but think seriously about delegating privileges to

    separate users)

  • Size Limits (MyISAM default 4GB limit can be modified, you just need

    to know the magic incantation)

  • Lock Contention – consider using InnoDB instead of MyISAM if you

    have as many readers as writers. MyISAM tends to work fine when you’ve

    got 90-95% readers and just a few writers (or vice-versa) but you can

    run into lock contention when there are lots of both. InnoDB doesn’t

    fix locking problems; it actually introduces some problems of its own.

  • ALTER TABLE is slow. Requires an exclusive

    write lock on the entire table, all queries will back up until it

    finishes. Plan ahead.

  • Disks often tend to be the bottleneck. You can add all of the CPU

    power in the world and it won’t matter if it’s waiting on a slow disk.

    Low seek times are more important than high transfer rates. RAID can

    help. If you have time, benchmark different disk combinations

    (suggested a tool called Bonnie++).

  • Load balancers. If you use one, choose the correct algorithm.

    Sometimes the “least connections” algorithm can make things worse.

    Often a simple “round-robin” algorithm works just great.

  • Handling many connections. Setting wait_timeout to a

    lower value will force idle connections to disconnect. Sometimes this

    can improve overall efficiency.

  • Data partitioning by servers (i.e. putting 1/Nth of your data on

    each of N clusters of servers). Instead of a single “users” table, you

    have 4 different tables (“users_abcdefg”, “users_hijklmn”,

    “users_opqrstu”, “users_vwxyz”) and the application needs to look at the

    first letter of the key to figure out which table to query.

  • Full-Text search is neat, but it has its limits. First, be sure to

    use 4.x, not 3.23. Also, it’s not as flexible as other software.

Zawodny also inserted a small Yahoo! advertisement in his slides; Yahoo! is hiring engineers. His

incentive is twofold. (1) Smart folks tend to go to OSCON, so it’s a

targeted audience, and (2) if you send him (or me) your resume we can

get the employee referral bonus if you end up getting hired.

Why XML Hasn’t Cured Our Ills or Saved the World

lg-xml-sticker.gif

After lunch and a little bit of work-related email, I went to Randy Ray‘s Why

XML Hasn’t Cured Our Ills or Saved the World (slides).

The talk centered around five things Ray thinks we do wrong with XML:

  1. People are too quick to use XML.

    You have to aks yourself if it’s really necessary. Is it just for

    buzzword-compliance?

    • If there is no reason other than the fact that there are XML

      parsers, then there probably is a simpler solution

    • If only a single consumer, there may be a more economical solution.
  2. People are too slow to use XML.
    • Plan ahead for more than one customer of data?
    • If another part of the system is already using XML for a more

      “legitimate” task, why not use XML for other things, too?

      (i.e. configuration data)

    • It isn’t always an extra cost. If the data format (and therefore

      the parser) would be sufficiently complex, maybe using an XML parser

      would be easier?

  3. Lack of cooperation or sharing.
    • Not often due to malice, perhaps lack of central authority. Who

      moderates DTD repositories? Registries on xml.com and xml.org contain

      outdataed information, and UDDI is too business-centric.

    • Example: difficult to find schema for recipies. Had to wade through

      3 pages of Google results to eventually find RecipeML

    • Intellectual Property issues. For example, Microsoft hasn’t

      openened up the XML formats for Office 2003. Compare to open formats

      like DocBook

  4. Misunderstanding the application of XML
    • XML is the “NetPBM” of generic data. (NetPBM broke new ground in

      image file format transformations by reducing an N * M problem

      to N + M).

    • People think that XML is only for “document” data.
  5. People want to make XML hard.
    • Tough topics make money. How can businesses sell

      books/tools/software/training/services when customers think that XML is

      “easy”? Vested interest in making it complicated.

In conclusion, Ray mused that no one technology is (yet) a universal solution and XML is no different when it comes data formats. His charge to the audience: just think about XML before using (or not using) it. Self-described experts don’t necessarily have all the answers.

Ruby for Perl Programmers

I stuck around for local software guy Phil Tomson’s Ruby for Perl Programmers talk. This session was more technical, with the first code example showing up on the 4th slide.

Phil’s slides are online, so I won’t attempt to replicate them here.

Something listed as a “gotcha” actually seems to be a feature to me. Since all variables hold references to objects, you have to explicitly call .dup to clone an object. It’s more Java-like than Perl, but it probably ends up being higher performance since you only make copies when you explicitly want them.

The Power and Philosophy of Ruby

Tower of Babel Yukihiro Matsumoto spoke about The Power and Philosophy of Ruby on Thursday morning. The talk was all philosophy, no code. Very entertaining.

We started off by discussing natural languages and the Tower of Babel, with a comparison of Japanese and its use of ideograms versus English. Matsumoto said that he was heavily influenced by the science fiction novel Babel-17. In some part, the power of the “super-language” in this book inspired him to create the Ruby programming language.

He spoke about the importance of choosing good names; those that are short and well-chosen usually convey meaning very easily. He also spoke about the importance of the machine making it easier for humans (Moore’s Law, evolution of programming languages to higher-level concepts). He feels it’s important for programming languages to cause the programmer as little stress as possible, and pointed out that one metric of a good programming language is that the programmer still has time to go out and have fun.

However, Matsumoto made it clear that simplicity is not a goal of Ruby. After all, human thoughts are not simple, and programs are essentially complex things. Rather, the design adheres to the principle of least surprise. If some aspect of the language meets your expectation, then it’s achieving its goal. Succinctness is highly valued because Matsumoto believes it leads to productivity and efficiency.

In Ruby, like in Perl, There’s More Than One Way To Do It, but the language can encourage one way. For example, Ruby does allow global variables, but you have to put a $ character before globals. Since too many $ are considered ugly, it discourages use of globals. “Dangerous” methods in Ruby have a ! in their name, for example sort and sort!. The “dangerous” methods might be faster, but they have side-effects, and the ! character reminds you to be careful.

Perl Lightning Talks

The Sound of Music Wandering around after lunch, I stopped by the Perl Lightning Talks (slides) session. I was delighted to hear Autrijus Tang‘s five-minute rap These are 1% of my favourite CPAN… in Chinese, followed by an English translation sung to the tune of These are a few of my favorite things… from The Sound of Music.

It was incredible. Standing ovation.

Allison Randal’s lightning talk was a parody of Arlo Guthrie’s Alice’s Restaurant. “You can get anything you want / in Perl 6 development.” Clever, but Autrijus is a hard act to follow.

Also notable was Dave Rolsky’s talk on DateTime. Dave, like my friends Gabriel and Rachel, is from Minnesota.

OSCON Wednesday morning

I bounced around on Wednesday between a bunch of different sessions. In the morning, I did some last-minute touch-ups on my slides, then caught the tail end of John Coggeshall’s Interfacing Java / COM with PHP. After my talk on One Year of PHP at Yahoo! (slides), I grabbed some lunch in the speaker’s room. Shane asked me to collect some feedback from my co-workers about Komodo since they’re starting to think about what might go into their 3.0 release.

I showed up a little bit late for Adam Trachtenberg’s Introduction to Web Services in PHP: SOAP versus REST talk, but the room was packed so I couldn’t find a seat. So I stuck my head inside Zak and Monty’s Guided Tour of the MySQL Source Code to catch an updated version of what had changed since the users conference in April.

I also checked out Shane’s Introduction to PEAR talk, but the conference room had run out of seats again. Too bad they didn’t pick a bigger room for the PHP talks this year.

Tim O’Reilly: Paradigm Shift

oreilly_header_part1.gif Tim O’Reilly gave this morning’s keynote address, “The Open Source Paradigm Shift”. The talk was reminiscent of last year’s Watching The Alpha Geeks keynote at ApacheCon, although now he is able to say the phrase “paradigm shift” with a straight face.

Largely the talk was trying to make the case that we shouldn’t try to think about Open Source software in the traditional commercial software business model. Instead, we should recognize that the software (to some extent) has become a commodity, just like hardware has become a commodity. The true value in Open Source is the businesses that grow up around it. For example, nobody pays for Sendmail and Apache, yet thousands of ISPs make money from providing web/email hosting services for their customers.

His charge to the audience was to embrace the fact that Open Source software has become a commodity, and to start to think of it (and all of the services that have grown up around it) as a platform. If we can develop services that support collaboration and end-user customization, and the data flows freely enough, we’ll somehow find a way to feed our families.

Building Data Warehouses with MySQL

John Ashenfelter spoke about Building Data Warehouses with MySQL. After surveying the audience with some questions about what database technology people use and how much data they store, he described what he felt was the one and only reason to create a data warehouse: to answer business questions.

The first two-thirds of the talk discussed DW in general and made very little reference to MySQL in particular.

One of the Ashenfelter’s “if you only learn 3 things from this talk” statements was architect for a data warehouse, but build a data mart. Data marts answer “vertical”-type questions. Each are focused on answering one narrow business process. But marts should share a consistent view of the data from the warehouse. You can think of a data warehouse as a collection of standardized data marts.

Getting your definitions consistent is important. What’s an order? The salesperson might think of an order as “I sold 59 baseball cards and I got $100” but the shipping depratment might send it out in 3 different shipments from 2 different order fulfillment centers. How many “orders” is that?

Also important to standardize on how the DW represents business policies and practices. For example, is revenue booked at sale or collection? How do you define “top customer”? Someone who buys more than half a million dollars a year, or someone who buys more than once a week? Gets these questions answered by the business people so when they use the DW they know what they’re getting.

An interesting sidebar: never use anything “meaningful” for a key. Product numberings/SKUs will be guaranteed to change, merger or acquisition with another company means that you’ll have to do customer id reassignments. Recommendation: use an int (not a varchar) which gives you flexibility for the inevitable change.

Ashenfelter described using a Star schema (not a snowflake schema) for representing the data. The DW should be centered around Facts which have Dimensions, but be sure not to normalize your Dimensions or you’ll end up doing joins of 17 different tables for your queries. It may drive traditional relation database engineers crazy, but denormalized data means fewer joins and faster performance. Some extra redundancy is worth that performance boost.

Next, we went through an example of a DW for Vmeals, a take-out/catering delivery service for businesses. We went through 6 steps for designing the DW:

  1. Plan the data warehouse design
  2. Create corporate metadata standards
  3. Pick a business process
  4. Determine the grain of the fact table
  5. Detail the dimensions of the facts
  6. Find the relevant facts

Speaking about MySQL in particular, Ashenfelter mentioned that MySQL 4.0 has greatly improved the speed of bulk insert, which is important for the E-T-L (Export-Transform-Load) part of data warehousing. His basic model is to get data in batch from Microsoft SQL Server or Oracle via some sort of dump, do some transformation (for example, to denormalize the data), then load the data into MySQL.

A couple of interesting notes: using a staging environment is a good way to provide efficiency and concurrency (so folks can still query yesterday’s data while you’re preparing today’s data). It also gives you a hook to do validation tests. For example, you could sum all of the January sales and compare whether or not the total matched what the computed total was yesterday. If it’s July and the data changed, it indicates that something with your source data is wrong, and it’s better to flag it so someone can investigate instead of releasing the data to production and giving the business folks an inconsistent view.

As the talk started wrapping up, Ashenfelter mentioned several Open Source tools (mostly written in Java) that work with MySQL for data warehousing. For E-T-L, he suggested CloverETL or Enhydra Octopus. For Reporting, he recommended Jasper Reports, jFreeReport, and DataViz. For OLAP tools, he mentioned Mondrian, JPivot, and BEE. For Delivery Frameworks, you could think about using Jetspeed, Webworks, or PHP-Nuke.