Movie Recommendations (Part 2)

Posted by Christopher Mertin on February 19, 2017 in Project • 13 min read

The user ratings for this project were downloaded from the MovieLens Dataset, and the ml-latest-small.zip dataset was chosen. This dataset consists of 100,000 ratings from 700 users on 9,000 movies.

In this implementation I opt to use a Recurrent Neural Network (RNN). A benefit of this is that it has the ability to give a bias term to each movie and user, thus making the ratings more accurate. However, this also means that it won’t be able to give “general results” in Part 1. This is because the predict function can only work for a user in the data set and predict how a user will rate a movie.

There are ways around this, for example, after training the model we can have a new user rate a certain number of movies and then use K-Nearest Neighbors to find the closest similar user and just use their values. One benefit the NN does bring though is it allows us to use a validation set to rate our model’s accuracy.

It’s important to note that this is done with the small sample size of the dataset, so some of these results may be off - for example the top movies of all time. However, this code will run just the same on the larger MovieLens dataset, albeit it would take much longer to run.

Latent Factors

The data is read in the following format

1
2
ratings = pd.read_csv("ratings.csv")
ratings.head()
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205

which shows that the ratings file consists of a movieID, a userID, and the rating the user gave such a movie. This is similar to the data in Part 1, and the timestamp is not required here either.

We can read in the Movie Names too to make them more readable, giving

1
2
3
4
5
6
7
8
9
# Create a dictionary for the movie id's to the names
movie_names = pd.read_csv("movies.csv").set_index("movieId")["title"].to_dict()
# Create a list of unique user id's
users = ratings.userId.unique()
# Create a list of unique movie id's
movies = ratings.movieId.unique()
# Create a dictionary of each to make the transition between the two easier
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}

One reason for the above is to make the transition between the data and readable format easier. This is espeically the case for movieid2idx. However, another reason is to allow the use of Latent factors.

Latent factors are factors that are unknown when we build the model, though the model learns these factors during training time. For example, when building this collaborative filtering model from users’ ratings, one of the factors may be how much a movie falls into the “sci-fi” category. The interesting part of this is that we don’t need any meta data for our model to learn this, it is able to learn it just based on user ratings of all of the movies.

Each movie gets a value for each “factor” that is learned, and these can be used to find movies that are similar to each other in an \(N\)-dimensional space. This is also one of the best ways to visualize these methods on a plot so we can see how spatially close movies are. This is done in the following section after the implementation. This factor also allows for the inclusion of a “bias” to account for users that just naturally rate movies higher than others, and for movies that users wind up rating higher than average. This will make for a much more accurate model in comparison to the one built in Part 1.

Building the recommendation system

In order to understand the data more, we can create a subset of the data which contains the most popular movies and the users that have rated the most movies. This can be done with the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Grab the top 15 users and movies
g = ratings.groupby("userId")["rating"].count()
topUsers = g.sort_values(ascending=False)[:15]
g = ratings.groupby("movieId")["rating"].count()
topMovies = g.sort_values(ascending=False)[:15]
# Join the data
top_r = ratings.join(topUsers, suffix="_r", how="inner", on="userId")
top_r = top_r.join(topMovies, rsuffix="_r", how="inner", on="movieId")
# Create a cross-tab of the data
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)

In looking at the output, we get the following table, where NaN is where a user didn’t rate that specific movie.

movieId
userId
27 49 57 72 79 89 92 99 143 179 180 197 402 417 505
14 3.0 5.0 1.0 3.0 4.0 4.0 5.0 2.0 5.0 5.0 4.0 5.0 5.0 2.0 5.0
29 5.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 4.0 4.0 5.0 5.0 3.0 4.0 5.0
72 4.0 5.0 5.0 4.0 5.0 3.0 4.5 5.0 4.5 5.0 5.0 5.0 4.5 5.0 4.0
211 5.0 4.0 4.0 3.0 5.0 3.0 4.0 4.5 4.0 NaN 3.0 3.0 5.0 3.0 NaN
212 2.5 NaN 2.0 5.0 NaN 4.0 2.5 NaN 5.0 5.0 3.0 3.0 4.0 3.0 2.0
293 3.0 NaN 4.0 4.0 4.0 3.0 NaN 3.0 4.0 4.0 4.5 4.0 4.5 4.0 NaN
310 3.0 3.0 5.0 4.5 5.0 4.5 2.0 4.5 4.0 3.0 4.5 4.5 4.0 3.0 4.0
379 5.0 5.0 5.0 4.0 NaN 4.0 5.0 4.0 4.0 4.0 NaN 3.0 5.0 4.0 4.0
451 4.0 5.0 4.0 5.0 4.0 4.0 5.0 5.0 4.0 4.0 4.0 4.0 2.0 3.5 5.0
467 3.0 3.5 3.0 2.5 NaN NaN 3.0 3.5 3.5 3.0 3.5 3.0 3.0 4.0 4.0
508 5.0 5.0 4.0 3.0 5.0 2.0 4.0 4.0 5.0 5.0 5.0 3.0 4.5 3.0 4.5
546 NaN 5.0 2.0 3.0 5.0 NaN 5.0 5.0 NaN 2.5 2.0 3.5 3.5 3.5 5.0
563 1.0 5.0 3.0 5.0 4.0 5.0 5.0 NaN 2.0 5.0 5.0 3.0 3.0 4.0 5.0
579 4.5 4.5 3.5 3.0 4.0 4.5 4.0 4.0 4.0 4.0 3.5 3.0 4.5 4.0 4.5

The way that latent factors work is we can define \(x\) number of factors per user and movie. A dot product is then performed between the user latent factor and the movie factors to achieve the values in the table above. Back propagation is then used to adjust the factors such that the model will fit each user’s ratings.

One thing we need to change though is we need an “embedding.” Normally, when building a neural network one-hot encoding is used to determine what category/column we want to pull out. This would mean that if we had 1,000 users, we would need a 1k dimensional vector for each user. This also winds up slowing down the computational time as well, and makes the prediction and learning very slow.

Thankfully, we can use Keras’ functional API to create an embedding layer. All an embedding layer does is it takes an integer as the input and grabs the corresponding column as output. This will allow us to use something along the lines of [0] to grab the first movie, rather than having to perform a dot-product between [[1,0,0,0,...,0],[...],[...],...,[...]] and [[X1,X2,X3,...,Xn],[...],[...],...,[...]] to grab the latent factors of the first movie.

First implementation

This implementation is a very basic model just following the outline of the ideas proposed above. In the following section, we will implement a typical neural network which will wind up producing better results. First, we need to build our embedding layer as

1
2
3
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype="int64", name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

Following this, we need to transform our inputs so that they utilize the embedding, where n_factors is our number of latent factors. This was defined as n_factors = 50.

1
2
user_in, u = embedding_input("user_in", n_users, n_factors, 1e-4)
movie_in, m = embedding_input("movie_in", n_movies, n_factors, 1e-4)

We also need to separate our training and validation sets, which can be done by creating a mask. This can be done in a few lines as

1
2
3
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

We also need to create a function for building the bias term using Keras’ Embedding function

1
2
3
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

We can use this to create our final inputs with the bias terms added by

1
2
ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)

Where we can now create our model. We can use the Keras function merge which is part of the functional API to merge the inputs. [Note: This is different when compared to Merge, see an example here.] In doing so, we can build our model as the following

1
2
3
4
5
6
x = merge([u, m], mode='dot')
x = Flatten()(x)
x = merge([x, ub], mode='sum')
x = merge([x, mb], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')

And we can train our model by using

1
2
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
      validation_data=([val.userId, val.movieId], val.rating))

Which can be trained over multiple epochs, varying the step-size by using model.optimizer.lr = 0.001. This is useful as a decrease in step-size helps with convergence.

In training our model, we get a final loss of

Epoch 5/5
80032/80032 [==============================] - 5s - loss: 0.7689 - val_loss: 1.1462

Which is worse than the state of the art, which hover around 0.9. However, we can do better by building a neural network.

Before we do that though, we can predict how a user will like/rate a movie. This is done with the following function call, which is

1
model.predict([np.array([3]), np.array([6])])

which would predict how much user 3 would rate movie 6. The result is array([[ 5.1047]], dtype=float32), meaning there’s a very good chance that user 3 would like movie 6!

Analysis of Model

Before moving on to the Neural Network, we can visualize how our model is performing to try and understand how it’s working. This can be done by looking at the latent factors and the value that each movie has in comparison to the latent factors.

In order to make the results more interesting, we can restrict our analysis to only the top 2,000 most popular movies. This is done by

1
2
3
g=ratings.groupby('movieId')['rating'].count()
topMovies=g.sort_values(ascending=False)[:2000]
topMovies = np.array(topMovies.index)

Firstly, we can look at the movie bias term. We can do this by creating a “model” where the input is a movie ID and the output is the corresponding movie bias.

1
2
3
get_movie_bias = Model(movie_in, mb)
movie_bias = get_movie_bias.predict(topMovies)
movie_ratings = [(b[0], movie_names[movies[i]]) for i,b in zip(topMovies, movie_bias)]

We can now use the above to look at the top and bottom rated movies in our dataset. The top 15 worst rated (smallest bias) is given by

1
sorted(movie_ratings, key=itemgetter(0))[:15]

Resulting in the output of

  1. -0.36287931, 'Battlefield Earth (2000)'
  2. -0.14330482, 'Speed 2: Cruise Control (1997)'
  3. -0.11990164, 'Blade: Trinity (2004)'
  4. -0.072836787, '2 Fast 2 Furious (Fast and the Furious 2, The) (2003)'
  5. -0.055303905, 'House on Haunted Hill (1999)'
  6. -0.044537157, 'Super Mario Bros. (1993)'
  7. -0.040916786, 'Jaws 3-D (1983)'
  8. -0.02507983, 'Little Nicky (2000)'
  9. 0.0028442228, 'Police Academy 6: City Under Siege (1989)'
  10. 0.029959977, 'Spice World (1997)'
  11. 0.0312222, 'Two Weeks Notice (2002)'
  12. 0.043953076, 'Honey, I Blew Up the Kid (1992)'
  13. 0.057045761, "You Don't Mess with the Zohan (2008)"
  14. 0.059393514, 'Police Academy 5: Assignment: Miami Beach (1988)'
  15. 0.065637641, 'Hollow Man (2000)'

Which in just glancing at the list seems to make sense. For example, 2 Fast 2 Furious is one of the top 5 worst movies in the dataset, and the series has been falling in ratings with each successive movie. Another indication is Jaws 3-D which is notoriously known for it’s bad ratings.

On the flip side, we have the best rated movies. These were obtained from the following code, with the formatted output being found below

1
sorted(movie_ratings, key=itemgetter(0), reverse=True)[:15]
  1. 1.4145077, 'Band of Brothers (2001)'
  2. 1.3615623, 'Rush (2013)'
  3. 1.3285905, 'Tom Jones (1963)'
  4. 1.3229195, 'Shawshank Redemption, The (1994)'
  5. 1.3128592, "Howl's Moving Castle (Hauru no ugoku shiro) (2004)"
  6. 1.2950047, 'General, The (1926)'
  7. 1.2875359, 'Harry Potter and the Deathly Hallows: Part 2 (2011)'
  8. 1.2791849, 'Cyrano de Bergerac (1990)'
  9. 1.2767351, "Amores Perros (Love's a Bitch) (2000)"
  10. 1.2723553, 'My Neighbor Totoro (Tonari no Totoro) (1988)'
  11. 1.266189, 'Gold Rush, The (1925)'
  12. 1.2657483, 'Argo (2012)'
  13. 1.2558265, 'You Can Count on Me (2000)'
  14. 1.2550379, 'Fog of War: Eleven Lessons from the Life of Robert S. McNamara, The (2003)'
  15. 1.2471849, 'Exotica (1994)'

While I haven’t seen a lot of these movies in this list (but this gives me something to watch in the future!), I do notice quite a few movies that are known for being good. For example, Band of Brothers and Shawshank Redemption top the list and are part of most people’s all time favorites.

From here, we can also analyze the embeddings to see what they look like. If you recall, we implemented 50 embeddings/factors. We can use PCA to simplify them down to just 3 dimensions, so we can analyze them. This is done by

1
2
3
4
5
6
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(movie_emb.T).components_
# 1st factor
fac0 = movie_pca[0]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac0, topMovies)]

In soring the first component and looking at the top and bottom 10, we get

1
2
# Top 10
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
  1. 0.051661585, 'Star Wars: Episode IV - A New Hope (1977)'
  2. 0.051533699, 'American Beauty (1999)'
  3. 0.050743323, "Schindler's List (1993)"
  4. 0.050438453, 'Star Wars: Episode V - The Empire Strikes Back (1980)'
  5. 0.050106976, 'Usual Suspects, The (1995)'
  6. 0.049335167, 'Shawshank Redemption, The (1994)'
  7. 0.048646957, 'Monty Python and the Holy Grail (1975)'
  8. 0.048265114, 'Fight Club (1999)'
  9. 0.048234072, 'Lord of the Rings: The Fellowship of the Ring, The (2001)'
  10. 0.048104681, 'American History X (1998)'

And for the bottom 10

1
2
# Bottom 10
sorted(movie_comp, key=itemgetter(0))[:10]
  1. -0.013544421, 'Anaconda (1997)'
  2. -0.01036181, 'RoboCop 3 (1993)'
  3. -0.01001894, 'Police Academy 3: Back in Training (1986)'
  4. -0.0095419278, 'Battlefield Earth (2000)'
  5. -0.0086108595, 'Blade: Trinity (2004)'
  6. -0.0084395828, 'Godzilla (1998)'
  7. -0.0081564896, 'Police Academy 5: Assignment: Miami Beach (1988)'
  8. -0.0080419034, 'Super Mario Bros. (1993)'
  9. -0.0078185201, 'Police Academy 6: City Under Siege (1989)'
  10. -0.0073015266, 'Howard the Duck (1986)'

From the above, it seems as if the first component measures how much the movie falls into the category of being a “classic.” We can do the same for the other 2 components

1
2
3
4
fac1 = movie_pca[1]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac1, topMovies)]
# Top 10
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
  1. 0.099159904, 'Armageddon (1998)'
  2. 0.096524552, 'Independence Day (a.k.a. ID4) (1996)'
  3. 0.076062717, 'Stargate (1994)'
  4. 0.073885009, 'Titanic (1997)'
  5. 0.072068073, 'Speed (1994)'
  6. 0.07041952, 'Waterworld (1995)'
  7. 0.06948819, 'Happy Gilmore (1996)'
  8. 0.069047861, 'Jurassic Park (1993)'
  9. 0.06774541, 'Rock, The (1996)'
  10. 0.067659229, 'Braveheart (1995)'

And for the bottom 10

1
2
# Bottom 10
sorted(movie_comp, key=itemgetter(0))[:10]
  1. -0.065151595, 'Brokeback Mountain (2005)'
  2. -0.063093834, 'Manhattan (1979)'
  3. -0.060578294, 'Clockwork Orange, A (1971)'
  4. -0.060415618, 'Apocalypse Now (1979)'
  5. -0.059806734, 'Annie Hall (1977)'
  6. -0.053494852, 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)'
  7. -0.053420749, 'Chinatown (1974)'
  8. -0.052724, 'City Lights (1931)'
  9. -0.051793721, 'Lost in Translation (2003)'
  10. -0.05177296, 'Taxi Driver (1976)'

Which seem to indicate that the 2nd factor is how much a movie is a “Hollywood Blockbuster.”

Finally, the third factor gives us

1
2
3
4
fac2 = movie_pca[2]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac2, topMovies)]
# Top 10
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
  1. 0.07112249, 'American Psycho (2000)'
  2. 0.066470303, 'Serenity (2005)'
  3. 0.062538557, 'Italian Job, The (2003)'
  4. 0.060326062, 'Harry Potter and the Order of the Phoenix (2007)'
  5. 0.05812693, 'Boondock Saints, The (2000)'
  6. 0.057720546, 'Taken (2008)'
  7. 0.055617765, 'Blind Side, The (2009)'
  8. 0.055248067, 'Tangled (2010)'
  9. 0.054995965, 'Harry Potter and the Chamber of Secrets (2002)'
  10. 0.054077402, 'Roman Holiday (1953)'

And for the bottom 10

1
2
# Bottom 10
sorted(movie_comp, key=itemgetter(0))[:10]
  1. -0.098564163, 'Silence of the Lambs, The (1991)'
  2. -0.092131585, 'Fargo (1996)'
  3. -0.084806047, 'Babe (1995)'
  4. -0.083482467, 'Jurassic Park (1993)'
  5. -0.082762003, 'Rob Roy (1995)'
  6. -0.082035542, 'Fugitive, The (1993)'
  7. -0.081005029, 'Lion King, The (1994)'
  8. -0.079656199, "Schindler's List (1993)"
  9. -0.079320841, 'Mrs. Doubtfire (1993)'
  10. -0.07347735, 'Wallace & Gromit: A Close Shave (1995)'

Which seems to be some linear combination of violence, sci-fi, and happy. We can visualize these factors on a plot to see how these movies are clustered together in space. While it would be interesting to perform KMeans on this dataset, we’re simply going to look at the spatial relation for some insight. We do this by plotting the top 50 movies based on the 1st and 3rd PCA factor. It’s important to note that quite a bit of information is lost when projecting down to 3 dimensions, from 50.

1
2
3
4
5
6
7
8
start=0; end=50
X = fac0[start:end]
Y = fac2[start:end]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(topMovies[start:end], X, Y):
    plt.text(x,y,movie_names[movies[i]], color=np.random.rand(3)*0.7, fontsize=14)
plt.show()

nn_chart

Which shows some interesting characteristics from the movies. For example, Ace Ventura: Pet Detective and The Mask seem to be close together, which makes sense as Jim Carrey stars in both and they’re both comedies.

We can also see that Mission: Impossible, The Terminatior, and Independence Day seem to be spatially close as well, which are all action type movies.

Finally, we can also see that Lord of the Rings: The Two Towers, Lord of the Rings: The Return of the King, Lord of the Rings: The Fellows of the Ring, Star Wars: Episode V, Star Wars: Episode 6 are all close together as well, which makes sense as they’re all a similar type of fiction action-adventure.

Most likely, there’s other structural similarities in these movies as well, though it’s very hard to decipher in just 2 dimensions down from 50. However, as we can see, it is still possible to find some structure!

Neural Network

Instead of specifically defining our architecture in the previous section, we can simply create a Neural Network to determine movie similarities. First, we need to concatenate the user and movie embeddings into a single vector, which we can then feed into the Neural Network.

1
2
user_in, u = embedding_input("user_in", n_users, n_factors, 1e-4)
movie_in, m = embedding_input("movie_in", n_movies, n_factors, 1e-4)

Then, in building our model

1
2
3
4
5
6
7
8
x = merge([u, m], mode="concat")
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation="relu")(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(Adam(0.001), loss="mse")

which we can fit the data with

1
2
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=8,
      validation_data=([val.userId, val.movieId], val.rating))

Which, after 8 epochs, gives

80032/80032 [==============================] - 7s - loss: 0.8035 - val_loss: 0.8339

which is a loss of 0.8339, which is better than the state of the art. With another 8 epochs we can get this down to 0.7883.

Running a prediction on this model is very similar to the previous section, where we can simply use

1
nn.predict([np.array([3]), np.array([6])])

which gives a value of 4.5344, very similar to our previous model, though most likely more accurate as our validation set Mean Squared Error is almost half!