Technical Interview: movie SQL schema design

When interviewing candidates, sometimes we need to know if they grok SQL. Here’s a simple interview question that gives me an idea that someone has had to create a schema from scratch, vs. always using a schema that someone else created.

Setup

We’re going to design a movies database. Each movie has a title and year, one (and only one) director, and some number of actors. Actors can star in multiple movies. Directors can direct multiple movies. Some movies have the same title such as Ocean’s Eleven (the 2001 version directed by Steven Sodenbergh had George Clooney, Brad Pitt, Matt Damon, Julia Roberts, and many others, but the 1960 version was directed by Lewis Milestone and starred Frank Sinatra, Dean Martin and Sammy Davis Jr).

The schema should be normalized enough to avoid duplicating strings too much, and also to be able to efficiently answer questions like these two:

  • Who acted in Fight Club (1999)?
  • What are the 10 most recent movies that George Clooney starred in?

Questions

Use your judgement about what a good answer is in terms of what to name the tables, what keys they should have. To be consistent, we’ll ask every candidate these 3 questions:

  1. How many tables?
  2. What are the names of the tables?
  3. What are the names of each column in those tables? What are the types of each column?

When designing the schema, remind the candidate that we’re not looking to over-design the perfect, most flexible schema. We want the simplest possible schema that answer questions like the two aforementioned ones. We can mention that if time permits, we’ll ask the candidate to actually write the query for one of them.

Schema design

One possible answer (and there are many others that are acceptable):

movie
--------
id (int PK)
title (varchar)
year (int)
director_id (int)
 
person
----------
id (int pk)
name (varchar)
 
movie_actor
-----------
actor_id (int)
movie_id (int)

Sometimes candidates will get excited by the different roles (actors vs directors) and want to create a role table. Sometimes they’ll go to town on the person attributes (date of birth, splitting first and last names into separate columns). Some movie buff candidates will also point out that in real life, movies can have multiple directors (for example, The Matrix was directed by the Wachowski Brothers).

These are outside of the artificially simplified constraints we specified earlier, but can be fine if the candidate is really good at relational database design and fast at articulating their thoughts. But the candidate can also get distracted, which is a good simulation of what often happens with real-world business problems. Can the candidate focus on only the essentials?

If the candidate gets really distracted, you can invoke the YAGNI principle. This could be an opportunity to ask the candidate to share an anecdote about when they had to apply YAGNI.

Query

If the candidate is doing well, we ask them to write the SQL for either or both of the aforementioned queries.

What are the 10 most recent movies that George Clooney starred in?

SELECT m.title,m.year
FROM movie m, movie_actor ma, person p
WHERE p.name = 'George Clooney'
AND ma.actor_id = p.id
AND ma.movie_id = m.id
ORDER BY m.year DESC
LIMIT 10;

Again, there are many correct solutions (such as using an explicit JOIN instead of a natural join as we’ve done above). And yes, we know that Oracle doesn’t support LIMIT clauses, so you can do a nested query and then use the first ten rownums. The goal here isn’t to get into vendor-specific SQL syntax discussion, but more to see if they can write a query that uses more than one table.

“Unique” Random Numbers technical interview question

Dan Karp used to ask this question at Zimbra. I’ve always loved it.

1. warmup

Given this API from the java.lang standard Math class in the JDK:

/**
* Returns a double value with a positive sign,
* greater than or equal to 0.0 and less than 1.0.
* Returned values are chosen randomly with
* uniform distribution from that range.
*/
public static double Math.random();

Your task is to implement a random integer in the range [0,999]. Don’t overthink it; this is just a warmup.

Solution:

public class Foo {
public static int randomInt1000() {
return (int) (Math.random() * 1000);
}
}

Idea is to truncate a number like 123.99748 to 123 by casting a double to an int. No need to use Math.floor(), that just wastes a function call and does some floating-point arithmetic.

Some candidates will want to avoid hard-coding the 1000 constant, and will create a function signature like this:

public static int randomInt(int N);

Great. Either answer is totally fine.

2. introduce the idea of unique-ness

Things are about to get harder.

If I call your Foo.randomInt1000() function a bunch of times, I’ll get values like this:

316, 675, 22, 671, 724, 394, 31, 177, 997, …, 82, 671, 810, 433, …

If you call the function enough times, of course you’ll see the same number more than once.

Create a new function called uniqueRandomInt1000() that returns a random integer in the range [0,999] inclusive but doesn’t return duplicates. Your function will need to keep some state, and it’s up to you how to manage that state.

We’re looking for (1) a correct solution to the problem, and (2) an efficient solution. Correctness is more important than efficiency. But efficiency matters, too.

Edge case: if the caller calls uniqueRandomInt1000() more than 1000 times, the implementation needs to signal in some way that there are no more unique random numbers that can be returned because every number in the range [0,999] has been returned to the caller. Either return -1 or throw an Exception.

3.0. a naive, correct, but O(unbounded) solution

Keep an array of booleans or a java.util.BitSet, or a Set<Integer> of every number you’ve returned as an instance variable for your class. Every time the method gets called, generate a random integer in the range 0,999 and then check your instance variable to see if that number has been “seen” by the caller. If so, randomly generate another integer in the range 0,999 and repeat until you find an “unseen” number. Mark that number as seen and return to the caller.

3.1. a correct, O(N^2) solution using a List

Key idea here to avoid looping is to generate a random index in an ever-smaller range of numbers, and use that index to select a number from a LinkedList or an ArrayList.

In the ctor you populate a List with 1000 elements – every number between 0,999. That’s O(N) time.

Then, in the function, you generate a random index in the range 0,size()-1. size() is initially 1000, but shrinks over time as you remove() elements from the List.

get() followed by remove() is always O(N). With an ArrayList, you get constant time indexing into the position you want to retrieve (get()), but then you spend linear time copying elements over to compact the list (remove()). With a LinkedList you get linear scanning to move to the node at a particular index (get()), and then you get constant time to tidy up the list to remove that element (remove()).

If you ask for N random numbers from the class you get N * O(N) == O(N^2) cost.

3.2. solution using Java collections O(N)

This solution does O(2N) in the ctor and then O(1) for each call. It’s kinda “cheating” a little bit because all of the hard algorithmic work is done by a single function call Collections.shuffle().

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class UniqueRandom {
private List numbers;
private int pos = 0;

public UniqueRandom(int n) {
numbers = new ArrayList(n);
for (int i = 0; i < n; i++) {
numbers.add(i);
}
Collections.shuffle(numbers);
}

public int nextInt() {
if (pos == numbers.size()) {
return -1;
}
return numbers.get(pos++);
}

public static void main(String[] args) {
UniqueRandom ur = new UniqueRandom(35);
for (int i = 0; i < 40; i++) {
System.out.println(ur.nextInt());
}
}
}

3.3. solution using an int[]

This solution does O(1N) in the ctor and then O(1) for each call.

public class UniqueRandom2 {
private int[] numbers;
private int avail;

public UniqueRandom2(int n) {
avail = n;
numbers = new int[avail];
for (int i = 0; i < avail; i++) {
numbers[i] = i;
}
}

public int nextInt() {
if (avail == 0) {
return -1;
}
int idx = (int) (Math.random() * avail);
int retval = numbers[idx];
numbers[idx] = numbers[--avail];
return retval;
}

public static void main(String[] args) {
UniqueRandom2 ur = new UniqueRandom2(1000);
for (int i = 0; i < 40; i++) {
System.out.println(ur.nextInt());
}
}
}

Thanks to Ryan Hamilton for reminding me about this so I had an excuse to dig up my sample code and post it online.

Commercial Graph: A Map of Financial Relationships

I’m speaking today about Intuit’s Commercial Graph at the Strata + Hadoop World Conference. Slides: Commercial Graph: A Map of Financial Relationships (pptx format).

Abstract

Imagine the social graph where personal relationships are replaced by commercial relationships based on real financial data. Imagine the possibilities for small businesses to grow, connect, transact and prosper.

Intuit is uniquely qualified to achieve just this. We are entrusted with the collective data of 50 million consumers and small businesses. It is a unique pool of data that covers the financial spectrum – ranging from individual purchase history to business inventories.

At Intuit, we are building the Commercial Graph with the consumer and small business data from products like Mint.com, Quicken, and QuickBooks.

We take millions of user-entered, and hence unstructured, business descriptions and billions of transactions and apply Hadoop based deduplication algorithms for normalization, and machine learning for categorization. In order to better understand the graph, we compute metrics such as connected components, centrality, and commercial PageRank.

We will examine several applications of the commercial graph, including finding more customers like your best customers, optimizing your vendors, and relevant offers & recommendations to help our customers make and save money.

A deep-dive on technical architecture will discuss use of Giraph as a Hadoop based large scale graph processing platform and neo4j as a real-time graph datastore.

Software Engineer, Java – Click Fraud Prevention

Want to build something that hunts down the bad guys and puts ’em out of business? Got experience building complex systems in Java? Fraudwall Technologies has the job for you.

We’re looking for engineers at all experience levels who want to help build a massive data processing and modeling pipeline, using cutting-edge machine learning and network forensics. You’ll be writing code that will make real-time decisions to prevent click fraud, and there’s going to be a fire hose of data coming at you.

This particular job comes with as much responsibility as you can handle. You won’t just be writing code; you’ll be doing design, architecture, implementation, testing, support, and more. Passion, talent, and raw brains are more important than tons of industry experience.

Required experience:

* 3-5 years of software development in Java (top-notch C++ and C# engineers can apply, too)

* Superb understanding of data structures and algorithms

* Effective communication skills: you’ll have to be able to fluently communicate with modelers/analysts, business people, and other coders

* Experience with Unix/Linux, and relational databases such as MySQL or Oracle

* BS or MS in Computer Science or equivalent

Desirable experience:

* Machine learning, information retrieval, TCP/IP internals

* Java frameworks: Hibernate, Servlets, Jakarta Commons

* Proficiency with scripting languages such as Python or Perl

About the company:

Fraudwall Technologies provides advertising networks and advertisers with a pioneering solution for identifying click fraud. Fraudwall combines cutting edge science with the aggregation of data and characteristics from networks, search engines, and advertisers into one complete scalable solution.

Fraudwall values honesty and integrity in dealing with each other and with our partners and customers. We offer competitive salaries, 401K, stock options, and health, dental, and vision plans. And of course, we provide an opportunity to work with world-class fraudfighters, systems builders, and serial entrepreneurs.

All positions are for our office in Palo Alto, California.

Send your resume to michael.radwin@fraudwall.net

Threads considered harmful

In the past month I’ve seen at least 3 messages on the development email lists at work asking questions about developing multi-threaded applications. From a software engineering standpoint, this troubles me.

I’ve always thought that multi-threaded apps in C/C++ are simply too difficult for most engineers to understand. There’s too much non-determinism, too many race conditions, and too few language-level constructs to keep yourself from screwing up.

This isn’t to say that some engineers can’t figure it out, it’s just that most engineers can’t. I’ll borrow a diagram from Ousterhout to illustrate this point:

What's Wrong With Threads?

John Ousterhout, Why Threads Are a Bad Idea (for most purposes), 1996. PDF slides from USENIX 1996 talk (local mirror).

I’ve been reading The Art of UNIX Programming by Eric Raymond over the past few weeks and it appears that he agrees with me. He avoids the Dijkstra-esque pun on threads being harmful and instead perfers the equally-provoking title Threads — Threat or Menace?

My attitude about threads Java is different because the language has supported the concept of threads since day one. It’s still tricky to do threads correctly in Java, but not as painful as it is in C++.

XML for Makefiles?

ant.jpg XML hasn’t cured our ills or saved the world, but people keep using it for absurd purposes anyways.

I finally took a quick look at Apache Ant today to see what all the fuss is about. Apparently with some additional components you can actually get Ant to build C/C++ code.

However, compare this build.xml for Ant:


<?xml version="1.0"?>

<project name="Hello" default="hello" basedir=".">

<taskdef resource="cpptasks.tasks"/>

<taskdef resource="cpptasks.types"/>

<target name="hello">

<cc name="gcc" outfile="hello">

<fileset dir="." includes="hello.c"/>

<compilerarg value="-O2"/>

</cc>

</target>

</project>

with this Makefile for gmake:


hello: hello.c

gcc -O2 $< -o $@

I think I’ll stick with gmake for now.

How to Be a Programmer

I stumbled across How to Be a Programmer, a 40-page paper by Robert L. Read, a principal engineer at Hire.com.

It’s a relatively good paper so I’d recommend it to anyone who’s new to the field or is a college student considering a career in Software Engineering. The distinction between Computer Science and Software Engineering, while subtle, is an important one. This paper focuses more on the Software Engineering side of things, spending a good 50% of the time discussing interpersonal skills and how to be effective working with your team.

The paper does need some polishing, however. A simple grammar checker would catch a bunch of the mistakes that interrupt the flow.

This reminds me a little bit of a great lecture I heard by Leslie Pack Kaelbling back in 1996 about why she loves programming. Like Read, Kaelbling belives that debugging is the most important part of programming, but she spins it slightly differently.

In short, debugging is like detective work. You’ve got a problem that you need to solve, but it’s not obvious what the solution is. There are little hints here and there, and you begin to investigate each one. Each clue brings you closer and closer to the solution, but sometimes you realize that you just spent the last 6 hours going down a path that led nowhere, and you need to start over again. But at each moment, you always feel like you’re making forward progress.

As a consequence, debugging becomes an all-engrossing activity. It’s impossible to walk away from your desk when you’re just 5 minutes away from solving the mystery and fixing the bug! Of course, 20 minutes later, you still feel like you’ll get it nailed in another five.

MySQL Users Conference 2003

mysql.png The MySQL Users Conference 2003 is running from April 10 – 12 in San Jose, CA. I was nearby in Sunnyvale for work on Tuesday & Wednesday this week, so I stuck around a day longer than my usual LAX-SJC travel schedule to catch the beginning of the conference.

Thanks to Zak for all of his hard work organizing the show. The first day was great; I’m sorry I’ll be missing the rest of it.

P4100101.JPG

The State of the Dolphin Address

David Axmark and and Monty Widenius, creators of MySQL (and co-founders of MySQL AB) kicked off the event with “The State of the Dolphin Address.”

The first 15 minutes of the presentation was all bragging — they listed off some big customers (such as Yahoo! and Slashdot), awards they had won, and some notable events in the lifetime of the product and company. Axmark takes great pride in the fact that Oracle introduced a MySQL migration kit in 2001.

Speaking a little bit about MySQL AB, Axmark indicated that they now have 12 full time engineers working on the server, and dozens of customer support folks. They’ve been making money via commercial licenses (for companies that don’t want to GPL their code), and also from selling support, training, certification and consulting. The recently-introduced MySQL Certification program costs $200 (with a $50 discount until this fall).

As a product, MySQL has a variety of features. Aside from supporting “an extended subset” of the ANSI SQL89 standard, they support ACID transactions, User Defined Functions (unfortunately not the same thing as Stored Procedures), and a handful of SQL extensions (such as SELECT … LIMIT). Client interfaces are available in over a dozen programming languages and operating systems.

It also provides about 5 different storage engines (MyISAM, InnoDB, Hash/InMemory, BerkeleyDB, etc.) which allow different tradeoffs depending on the application needs. For example, if you need fast row-level locking, you should pick the InnoDB, and recoginize that there will be some extra overhead on inserts.

Axmark also bragged a bit about the eWeek benchmarking tests which compared MySQL, Oracle9i, and a handful of other relational databases using JDBC drivers in a web server environment on Microsoft Windows. The MySQL performance curve (in terms of web pages per second and latency) matched Oracle’s and outperformed all others.

Lastly, the two co-founders gave a high-level overview of the various server versions (3.23, 4.0, 4.1, 5.0) and some new interesting features coming soon.

P4100100.JPG

Schmoozing

After the keynote, I grabbed coffee and a pastry and chatted a bit in the hallway with Rasmus and Zak. Zak introduced me to Sascha (the one from Utah) and Monty. No business cards, just a few handshakes.

Someone (not the person in the picture) asked Rasmus a question about using the PHP mail() function to send hundreds of thousands of messages.

I was tickled to see Brad from Zend; I saw him in Israel just a couple of weeks earlier.

P4100106.JPG

Using MySQL Replication in Large Scales

I stepped into Jeremy‘s standing-room only talk on “Using MySQL Replication in Large Scales.”

Being a MySQL novice, I didn’t understand much of the talk. It’s always a neat experience to surround yourself in a technical environment where everyone around you knows more than you do. A good way to pick up a bunch of ideas. There were ton of questions posed by the audience during the talk; it’s rare to see this high of a level of interaction with an audience this large.

Aesthetic note: Jeremy finally switched his slide colors from white-on-blue to the more boring (but easy to read) black-on-white.

Lunch

Lunch was pretty good. Lots of vegetarian options. I sat at a table full of Yahoos and Brian Aker. It started to rain, so we all scrambled inside. We went to hear the talk about Lufthansa Systems porting MySQL to NetWare. Novell is desperate to remain relevant, and it looks like they’re trying to embrace Open Source as a way to stay alive.

P4100099.JPG

A Guided Tour of the MySQL Source Code

Monty and Zak’s talk on A Guided Tour of the MySQL Source Code was a great introduction to a codebase I’ve never read before. The 5.0 source code became available via BitKeeper just a few days ago.

Unfortunately, the talk was plagued by technical difficulties. The LCD projector just wouldn’t cooperate with the laptop. Zak had a copy of the presentation on a floppy disk, but nobody else in the room had a laptop that could read it. Bummer.