Category Archives: Computer Science

Commercial Graph: A Map of Financial Relationships

I’m speaking today about Intuit’s Commercial Graph at the Strata + Hadoop World Conference. Slides: Commercial Graph: A Map of Financial Relationships (pptx format).

Abstract

Imagine the social graph where personal relationships are replaced by commercial relationships based on real financial data. Imagine the possibilities for small businesses to grow, connect, transact and prosper.

Intuit is uniquely qualified to achieve just this. We are entrusted with the collective data of 50 million consumers and small businesses. It is a unique pool of data that covers the financial spectrum – ranging from individual purchase history to business inventories.

At Intuit, we are building the Commercial Graph with the consumer and small business data from products like Mint.com, Quicken, and QuickBooks.

We take millions of user-entered, and hence unstructured, business descriptions and billions of transactions and apply Hadoop based deduplication algorithms for normalization, and machine learning for categorization. In order to better understand the graph, we compute metrics such as connected components, centrality, and commercial PageRank.

We will examine several applications of the commercial graph, including finding more customers like your best customers, optimizing your vendors, and relevant offers & recommendations to help our customers make and save money.

A deep-dive on technical architecture will discuss use of Giraph as a Hadoop based large scale graph processing platform and neo4j as a real-time graph datastore.

Software Engineer, Java – Click Fraud Prevention

Want to build something that hunts down the bad guys and puts ‘em out of business? Got experience building complex systems in Java? Fraudwall Technologies has the job for you.

We’re looking for engineers at all experience levels who want to help build a massive data processing and modeling pipeline, using cutting-edge machine learning and network forensics. You’ll be writing code that will make real-time decisions to prevent click fraud, and there’s going to be a fire hose of data coming at you.

This particular job comes with as much responsibility as you can handle. You won’t just be writing code; you’ll be doing design, architecture, implementation, testing, support, and more. Passion, talent, and raw brains are more important than tons of industry experience.

Required experience:

* 3-5 years of software development in Java (top-notch C++ and C# engineers can apply, too)

* Superb understanding of data structures and algorithms

* Effective communication skills: you’ll have to be able to fluently communicate with modelers/analysts, business people, and other coders

* Experience with Unix/Linux, and relational databases such as MySQL or Oracle

* BS or MS in Computer Science or equivalent

Desirable experience:

* Machine learning, information retrieval, TCP/IP internals

* Java frameworks: Hibernate, Servlets, Jakarta Commons

* Proficiency with scripting languages such as Python or Perl

About the company:

Fraudwall Technologies provides advertising networks and advertisers with a pioneering solution for identifying click fraud. Fraudwall combines cutting edge science with the aggregation of data and characteristics from networks, search engines, and advertisers into one complete scalable solution.

Fraudwall values honesty and integrity in dealing with each other and with our partners and customers. We offer competitive salaries, 401K, stock options, and health, dental, and vision plans. And of course, we provide an opportunity to work with world-class fraudfighters, systems builders, and serial entrepreneurs.

All positions are for our office in Palo Alto, California.

Send your resume to michael.radwin@fraudwall.net

Threads considered harmful

In the past month I’ve seen at least 3 messages on the development email lists at work asking questions about developing multi-threaded applications. From a software engineering standpoint, this troubles me.

I’ve always thought that multi-threaded apps in C/C++ are simply too difficult for most engineers to understand. There’s too much non-determinism, too many race conditions, and too few language-level constructs to keep yourself from screwing up.

This isn’t to say that some engineers can’t figure it out, it’s just that most engineers can’t. I’ll borrow a diagram from Ousterhout to illustrate this point:

What's Wrong With Threads?

John Ousterhout, Why Threads Are a Bad Idea (for most purposes), 1996. PDF slides from USENIX 1996 talk (local mirror).

I’ve been reading The Art of UNIX Programming by Eric Raymond over the past few weeks and it appears that he agrees with me. He avoids the Dijkstra-esque pun on threads being harmful and instead perfers the equally-provoking title Threads — Threat or Menace?

My attitude about threads Java is different because the language has supported the concept of threads since day one. It’s still tricky to do threads correctly in Java, but not as painful as it is in C++.

XML for Makefiles?

ant.jpg XML hasn’t cured our ills or saved the world, but people keep using it for absurd purposes anyways.

I finally took a quick look at Apache Ant today to see what all the fuss is about. Apparently with some additional components you can actually get Ant to build C/C++ code.

However, compare this build.xml for Ant:


<?xml version="1.0"?>

<project name="Hello" default="hello" basedir=".">

<taskdef resource="cpptasks.tasks"/>

<taskdef resource="cpptasks.types"/>

<target name="hello">

<cc name="gcc" outfile="hello">

<fileset dir="." includes="hello.c"/>

<compilerarg value="-O2"/>

</cc>

</target>

</project>

with this Makefile for gmake:


hello: hello.c

gcc -O2 $< -o $@

I think I’ll stick with gmake for now.

How to Be a Programmer

I stumbled across How to Be a Programmer, a 40-page paper by Robert L. Read, a principal engineer at Hire.com.

It’s a relatively good paper so I’d recommend it to anyone who’s new to the field or is a college student considering a career in Software Engineering. The distinction between Computer Science and Software Engineering, while subtle, is an important one. This paper focuses more on the Software Engineering side of things, spending a good 50% of the time discussing interpersonal skills and how to be effective working with your team.

The paper does need some polishing, however. A simple grammar checker would catch a bunch of the mistakes that interrupt the flow.

This reminds me a little bit of a great lecture I heard by Leslie Pack Kaelbling back in 1996 about why she loves programming. Like Read, Kaelbling belives that debugging is the most important part of programming, but she spins it slightly differently.

In short, debugging is like detective work. You’ve got a problem that you need to solve, but it’s not obvious what the solution is. There are little hints here and there, and you begin to investigate each one. Each clue brings you closer and closer to the solution, but sometimes you realize that you just spent the last 6 hours going down a path that led nowhere, and you need to start over again. But at each moment, you always feel like you’re making forward progress.

As a consequence, debugging becomes an all-engrossing activity. It’s impossible to walk away from your desk when you’re just 5 minutes away from solving the mystery and fixing the bug! Of course, 20 minutes later, you still feel like you’ll get it nailed in another five.

MySQL Users Conference 2003

mysql.png The MySQL Users Conference 2003 is running from April 10 – 12 in San Jose, CA. I was nearby in Sunnyvale for work on Tuesday & Wednesday this week, so I stuck around a day longer than my usual LAX-SJC travel schedule to catch the beginning of the conference.

Thanks to Zak for all of his hard work organizing the show. The first day was great; I’m sorry I’ll be missing the rest of it.

P4100101.JPG

The State of the Dolphin Address

David Axmark and and Monty Widenius, creators of MySQL (and co-founders of MySQL AB) kicked off the event with “The State of the Dolphin Address.”

The first 15 minutes of the presentation was all bragging — they listed off some big customers (such as Yahoo! and Slashdot), awards they had won, and some notable events in the lifetime of the product and company. Axmark takes great pride in the fact that Oracle introduced a MySQL migration kit in 2001.

Speaking a little bit about MySQL AB, Axmark indicated that they now have 12 full time engineers working on the server, and dozens of customer support folks. They’ve been making money via commercial licenses (for companies that don’t want to GPL their code), and also from selling support, training, certification and consulting. The recently-introduced MySQL Certification program costs $200 (with a $50 discount until this fall).

As a product, MySQL has a variety of features. Aside from supporting “an extended subset” of the ANSI SQL89 standard, they support ACID transactions, User Defined Functions (unfortunately not the same thing as Stored Procedures), and a handful of SQL extensions (such as SELECT … LIMIT). Client interfaces are available in over a dozen programming languages and operating systems.

It also provides about 5 different storage engines (MyISAM, InnoDB, Hash/InMemory, BerkeleyDB, etc.) which allow different tradeoffs depending on the application needs. For example, if you need fast row-level locking, you should pick the InnoDB, and recoginize that there will be some extra overhead on inserts.

Axmark also bragged a bit about the eWeek benchmarking tests which compared MySQL, Oracle9i, and a handful of other relational databases using JDBC drivers in a web server environment on Microsoft Windows. The MySQL performance curve (in terms of web pages per second and latency) matched Oracle’s and outperformed all others.

Lastly, the two co-founders gave a high-level overview of the various server versions (3.23, 4.0, 4.1, 5.0) and some new interesting features coming soon.

P4100100.JPG

Schmoozing

After the keynote, I grabbed coffee and a pastry and chatted a bit in the hallway with Rasmus and Zak. Zak introduced me to Sascha (the one from Utah) and Monty. No business cards, just a few handshakes.

Someone (not the person in the picture) asked Rasmus a question about using the PHP mail() function to send hundreds of thousands of messages.

I was tickled to see Brad from Zend; I saw him in Israel just a couple of weeks earlier.

P4100106.JPG

Using MySQL Replication in Large Scales

I stepped into Jeremy‘s standing-room only talk on “Using MySQL Replication in Large Scales.”

Being a MySQL novice, I didn’t understand much of the talk. It’s always a neat experience to surround yourself in a technical environment where everyone around you knows more than you do. A good way to pick up a bunch of ideas. There were ton of questions posed by the audience during the talk; it’s rare to see this high of a level of interaction with an audience this large.

Aesthetic note: Jeremy finally switched his slide colors from white-on-blue to the more boring (but easy to read) black-on-white.

Lunch

Lunch was pretty good. Lots of vegetarian options. I sat at a table full of Yahoos and Brian Aker. It started to rain, so we all scrambled inside. We went to hear the talk about Lufthansa Systems porting MySQL to NetWare. Novell is desperate to remain relevant, and it looks like they’re trying to embrace Open Source as a way to stay alive.

P4100099.JPG

A Guided Tour of the MySQL Source Code

Monty and Zak’s talk on A Guided Tour of the MySQL Source Code was a great introduction to a codebase I’ve never read before. The 5.0 source code became available via BitKeeper just a few days ago.

Unfortunately, the talk was plagued by technical difficulties. The LCD projector just wouldn’t cooperate with the laptop. Zak had a copy of the presentation on a floppy disk, but nobody else in the room had a laptop that could read it. Bummer.

OSCON 2003 registration

hornbill.gif It looks like the O’Reilly folks have finally posted the abstract for my One Year of PHP at Yahoo! talk I’ll be giving this summer in Portland, Oregon.

I filled out the speaker registration page today and picked some tutorials to attend. Here’s what I’ll be going to:


- Tutorial

Session ID: 3959

Title: Introduction to XSLT

Date: 07/07/2003

Time: 8:45am to 12:15pm

Location: Columbia

- Tutorial

Session ID: 4149

Title: Designing and Creating Great Shared Libraries

Date: 07/07/2003

Time: 1:45pm to 5:15pm

Location: Willamette

- Tutorial

Session ID: 3982

Title: Building Data Warehouses with MySQL

Date: 07/08/2003

Time: 8:45am to 12:15pm

Location: Salon H

On Monday afternoon I’ll probably bounce back and forth between Theodore Ts’o’s “Designing and Creating Great Shared Libraries” and Bradley M. Kuhn’s “The GNU General Public License for Developers and Businesspeople.”

Instead of registering for something on Tuesday afternoon, I think I’ll explore Portland. I’ve never been there before.

Early Bird registration is now open (through May 23rd) at http://conferences.oreillynet.com/os2003/

Upgrade my servers? Yeah, right.

In software engineering, laziness is a positive attribute. If one can accomplish the same task in 3 lines of code instead of 30, a good engineer opts for the 3-line version. That’s why libraries of code are so popular.

Engineers are also risk-averse. Every change you make to the system can possibly de-stabilize it, so engineers like to leave a running system alone. Fred Brooks writes in The Mythical Man-Month that every change has about a 50% chance of introducing a new bug. Two steps forward, one step backwards.

But laziness and risk-aversion can be really negative attributes. How can you ever make any progress if you never touch the system? What if WordPerfect 5.1 was still the state of the art in 2003? We’d be missing out on a decade of improvements like WYSIWYG.

Consider the hypothetical case of the guy who’s trying to get the other 599 engineers at the company to upgrade their web servers to version N, when the vast majority of folks are still running version M.

If I’m happily running version M, what’s my incentive to upgrade? Sure, the guy who maintains the web server says it’s got some great new features, is faster, gives you some better management tools, and fixes a couple of bugs. But I don’t have time to skim the README to see if any of those features would be useful to me. Version M seems just fine to me, and something could go wrong if I go to version N.

Most importantly, senior management does not require that I pay any attention to the guy who maintains the web server. Even if I procmail all of the web server guy’s messages into /dev/null, I can still get a good review at the end of the year just for keeping my crappy property up and running.

The bummer for the guy who works on the web server is that he also happens to be one of the folks who spent the past 2 years trying to improve development process at the company. He helped build a software package-management tool that can tell you in near-realtime what versions of what software are installed on what servers. And when he checks the stats, he finds out that a lot of folks are running really old versions of the web server: versions J, K, and L. Getting people to upgrade to version N is going to be even more difficult.

Maybe this explains why most of his co-workers are still running Netscape 4.08.