Technical Interview: movie SQL schema design

When interviewing candidates, sometimes we need to know if they grok SQL. Here’s a simple interview question that gives me an idea that someone has had to create a schema from scratch, vs. always using a schema that someone else created.

Setup

We’re going to design a movies database. Each movie has a title and year, one (and only one) director, and some number of actors. Actors can star in multiple movies. Directors can direct multiple movies. Some movies have the same title such as Ocean’s Eleven (the 2001 version directed by Steven Sodenbergh had George Clooney, Brad Pitt, Matt Damon, Julia Roberts, and many others, but the 1960 version was directed by Lewis Milestone and starred Frank Sinatra, Dean Martin and Sammy Davis Jr).

The schema should be normalized enough to avoid duplicating strings too much, and also to be able to efficiently answer questions like these two:

  • Who acted in Fight Club (1999)?
  • What are the 10 most recent movies that George Clooney starred in?

Questions

Use your judgement about what a good answer is in terms of what to name the tables, what keys they should have. To be consistent, we’ll ask every candidate these 3 questions:

  1. How many tables?
  2. What are the names of the tables?
  3. What are the names of each column in those tables? What are the types of each column?

When designing the schema, remind the candidate that we’re not looking to over-design the perfect, most flexible schema. We want the simplest possible schema that answer questions like the two aforementioned ones. We can mention that if time permits, we’ll ask the candidate to actually write the query for one of them.

Schema design

One possible answer (and there are many others that are acceptable):

movie
--------
id (int PK)
title (varchar)
year (int)
director_id (int)
 
person
----------
id (int pk)
name (varchar)
 
movie_actor
-----------
actor_id (int)
movie_id (int)

Sometimes candidates will get excited by the different roles (actors vs directors) and want to create a role table. Sometimes they’ll go to town on the person attributes (date of birth, splitting first and last names into separate columns). Some movie buff candidates will also point out that in real life, movies can have multiple directors (for example, The Matrix was directed by the Wachowski Brothers).

These are outside of the artificially simplified constraints we specified earlier, but can be fine if the candidate is really good at relational database design and fast at articulating their thoughts. But the candidate can also get distracted, which is a good simulation of what often happens with real-world business problems. Can the candidate focus on only the essentials?

If the candidate gets really distracted, you can invoke the YAGNI principle. This could be an opportunity to ask the candidate to share an anecdote about when they had to apply YAGNI.

Query

If the candidate is doing well, we ask them to write the SQL for either or both of the aforementioned queries.

What are the 10 most recent movies that George Clooney starred in?

SELECT m.title,m.year
FROM movie m, movie_actor ma, person p
WHERE p.name = 'George Clooney'
AND ma.actor_id = p.id
AND ma.movie_id = m.id
ORDER BY m.year DESC
LIMIT 10;

Again, there are many correct solutions (such as using an explicit JOIN instead of a natural join as we’ve done above). And yes, we know that Oracle doesn’t support LIMIT clauses, so you can do a nested query and then use the first ten rownums. The goal here isn’t to get into vendor-specific SQL syntax discussion, but more to see if they can write a query that uses more than one table.