Technical Interview: movie SQL schema design

When interviewing candidates, sometimes we need to know if they grok SQL. Here’s a simple interview question that gives me an idea that someone has had to create a schema from scratch, vs. always using a schema that someone else created.

Setup

We’re going to design a movies database. Each movie has a title and year, one (and only one) director, and some number of actors. Actors can star in multiple movies. Directors can direct multiple movies. Some movies have the same title such as Ocean’s Eleven (the 2001 version directed by Steven Sodenbergh had George Clooney, Brad Pitt, Matt Damon, Julia Roberts, and many others, but the 1960 version was directed by Lewis Milestone and starred Frank Sinatra, Dean Martin and Sammy Davis Jr).

The schema should be normalized enough to avoid duplicating strings too much, and also to be able to efficiently answer questions like these two:

  • Who acted in Fight Club (1999)?
  • What are the 10 most recent movies that George Clooney starred in?

Questions

Use your judgement about what a good answer is in terms of what to name the tables, what keys they should have. To be consistent, we’ll ask every candidate these 3 questions:

  1. How many tables?
  2. What are the names of the tables?
  3. What are the names of each column in those tables? What are the types of each column?

When designing the schema, remind the candidate that we’re not looking to over-design the perfect, most flexible schema. We want the simplest possible schema that answer questions like the two aforementioned ones. We can mention that if time permits, we’ll ask the candidate to actually write the query for one of them.

Schema design

One possible answer (and there are many others that are acceptable):

movie
--------
id (int PK)
title (varchar)
year (int)
director_id (int)
 
person
----------
id (int pk)
name (varchar)
 
movie_actor
-----------
actor_id (int)
movie_id (int)

Sometimes candidates will get excited by the different roles (actors vs directors) and want to create a role table. Sometimes they’ll go to town on the person attributes (date of birth, splitting first and last names into separate columns). Some movie buff candidates will also point out that in real life, movies can have multiple directors (for example, The Matrix was directed by the Wachowski Brothers).

These are outside of the artificially simplified constraints we specified earlier, but can be fine if the candidate is really good at relational database design and fast at articulating their thoughts. But the candidate can also get distracted, which is a good simulation of what often happens with real-world business problems. Can the candidate focus on only the essentials?

If the candidate gets really distracted, you can invoke the YAGNI principle. This could be an opportunity to ask the candidate to share an anecdote about when they had to apply YAGNI.

Query

If the candidate is doing well, we ask them to write the SQL for either or both of the aforementioned queries.

What are the 10 most recent movies that George Clooney starred in?

SELECT m.title,m.year
FROM movie m, movie_actor ma, person p
WHERE p.name = 'George Clooney'
AND ma.actor_id = p.id
AND ma.movie_id = m.id
ORDER BY m.year DESC
LIMIT 10;

Again, there are many correct solutions (such as using an explicit JOIN instead of a natural join as we’ve done above). And yes, we know that Oracle doesn’t support LIMIT clauses, so you can do a nested query and then use the first ten rownums. The goal here isn’t to get into vendor-specific SQL syntax discussion, but more to see if they can write a query that uses more than one table.

“Unique” Random Numbers technical interview question

Dan Karp used to ask this question at Zimbra. I’ve always loved it.

1. warmup

Given this API from the java.lang standard Math class in the JDK:

/**
* Returns a double value with a positive sign,
* greater than or equal to 0.0 and less than 1.0.
* Returned values are chosen randomly with
* uniform distribution from that range.
*/
public static double Math.random();

Your task is to implement a random integer in the range [0,999]. Don’t overthink it; this is just a warmup.

Solution:

public class Foo {
public static int randomInt1000() {
return (int) (Math.random() * 1000);
}
}

Idea is to truncate a number like 123.99748 to 123 by casting a double to an int. No need to use Math.floor(), that just wastes a function call and does some floating-point arithmetic.

Some candidates will want to avoid hard-coding the 1000 constant, and will create a function signature like this:

public static int randomInt(int N);

Great. Either answer is totally fine.

2. introduce the idea of unique-ness

Things are about to get harder.

If I call your Foo.randomInt1000() function a bunch of times, I’ll get values like this:

316, 675, 22, 671, 724, 394, 31, 177, 997, …, 82, 671, 810, 433, …

If you call the function enough times, of course you’ll see the same number more than once.

Create a new function called uniqueRandomInt1000() that returns a random integer in the range [0,999] inclusive but doesn’t return duplicates. Your function will need to keep some state, and it’s up to you how to manage that state.

We’re looking for (1) a correct solution to the problem, and (2) an efficient solution. Correctness is more important than efficiency. But efficiency matters, too.

Edge case: if the caller calls uniqueRandomInt1000() more than 1000 times, the implementation needs to signal in some way that there are no more unique random numbers that can be returned because every number in the range [0,999] has been returned to the caller. Either return -1 or throw an Exception.

3.0. a naive, correct, but O(unbounded) solution

Keep an array of booleans or a java.util.BitSet, or a Set<Integer> of every number you’ve returned as an instance variable for your class. Every time the method gets called, generate a random integer in the range 0,999 and then check your instance variable to see if that number has been “seen” by the caller. If so, randomly generate another integer in the range 0,999 and repeat until you find an “unseen” number. Mark that number as seen and return to the caller.

3.1. a correct, O(N^2) solution using a List

Key idea here to avoid looping is to generate a random index in an ever-smaller range of numbers, and use that index to select a number from a LinkedList or an ArrayList.

In the ctor you populate a List with 1000 elements – every number between 0,999. That’s O(N) time.

Then, in the function, you generate a random index in the range 0,size()-1. size() is initially 1000, but shrinks over time as you remove() elements from the List.

get() followed by remove() is always O(N). With an ArrayList, you get constant time indexing into the position you want to retrieve (get()), but then you spend linear time copying elements over to compact the list (remove()). With a LinkedList you get linear scanning to move to the node at a particular index (get()), and then you get constant time to tidy up the list to remove that element (remove()).

If you ask for N random numbers from the class you get N * O(N) == O(N^2) cost.

3.2. solution using Java collections O(N)

This solution does O(2N) in the ctor and then O(1) for each call. It’s kinda “cheating” a little bit because all of the hard algorithmic work is done by a single function call Collections.shuffle().

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class UniqueRandom {
private List numbers;
private int pos = 0;

public UniqueRandom(int n) {
numbers = new ArrayList(n);
for (int i = 0; i < n; i++) { numbers.add(i); } Collections.shuffle(numbers); } public int nextInt() { if (pos == numbers.size()) { return -1; } return numbers.get(pos++); } public static void main(String[] args) { UniqueRandom ur = new UniqueRandom(35); for (int i = 0; i < 40; i++) { System.out.println(ur.nextInt()); } } }[/code]

3.3. solution using an int[]

This solution does O(1N) in the ctor and then O(1) for each call.

public class UniqueRandom2 {
private int[] numbers;
private int avail;

public UniqueRandom2(int n) {
avail = n;
numbers = new int[avail];
for (int i = 0; i < avail; i++) { numbers[i] = i; } } public int nextInt() { if (avail == 0) { return -1; } int idx = (int) (Math.random() * avail); int retval = numbers[idx]; numbers[idx] = numbers[--avail]; return retval; } public static void main(String[] args) { UniqueRandom2 ur = new UniqueRandom2(1000); for (int i = 0; i < 40; i++) { System.out.println(ur.nextInt()); } } }[/code] Thanks to Ryan Hamilton for reminding me about this so I had an excuse to dig up my sample code and post it online.