Building a Recommender System

April 25, 2019

Recommender systems are one of the most prominent examples of machine learning
in the wild today. They determine what shows up in your Facebook news feed, in
what order products appear on Amazon, what videos are suggested in your Netflix
queue, as well as countless other examples. But what are recommender systems,
and how do they work? This post is the first in a series exploring some
common techniques for building recommender systems as well as their implementation.

What is a recommender system?

A recommender system is an information filtering model that ranks or scores items
for users. There are generally two types of ranking methods:

Content-based filtering, in which recommended items are based on item-to-item
similarity and the user’s explicit preferences; and
Collaborative filtering, in which items are recommended to users based on
the preferences of other users with similar transaction histories and
characteristics.

The information used in collaborative filtering can be explicit, where users
provide ratings for each item, or implicit, where user preferences have to be
extracted from user behavior (purchases, views, etc). The most successful
recommender systems use hybrid approaches combining both filtering methods.

The MovieLens Datasets

To make this discussion more concrete, let’s focus on building recommender
systems using a specific example. GroupLens, a research group at the University
of Minnesota, has generously made available the MovieLens dataset.
This dataset consists of approximately 20 million user ratings applied to
27,000 movies by 138,000 users. In addition, the movies include genre and
date information. We’ll use this dataset to build our model.

Simple Content-based Filtering

Let’s build a simple recommender system that uses content-based filtering (i.e.
item similarity) to recommend movies for us to watch. First, load in the movie
dataset from MovieLens and multihot-encode the genre fields:

import pandas as pd
import numpy as np

movies = pd.read_csv("movies.csv")
# dummy encode the genre
movies = movies.join(movies.genres.str.get_dummies("|"))

The genres feature consists of one or more pipe (“|”) separated genres. The
last line above adds a column for each possible genre and puts a 1 in that entry
if the genre tag is present, or a 0 otherwise.

Let’s generate some recommendations based on item similarity using these tags.
A very common similarity measure for categorical data (such as tags) is cosine
similarity. For any two items $i$ and $j$, the cosine similarity of $i$ and $j$
is simply the cosine of the angle between $i$ and $j$ where $i$ and $j$ are
interpreted as vectors in feature space. Recall that the cosine is obtained from
the inner product of these vectors:

$$
cos theta = frac{i cdot j}{||i|| ||j||}
$$

As a concrete example, consider the films $i := $ Toy Story (genre tags “Adventure”,
“Animation”, “Children”, “Comedy”, and “Fantasy”) and $j := $ Jumanji (genre tags
“Adventure”, “Children”, and “Fantasy”). The dot product $i cdot j$ is 3 (the
two films have three tags in common). $||i|| = sqrt{5}$ and $||j|| = sqrt{3}$,
so the cosine similarity between these two films is

$$
cos theta = frac{3}{sqrt{15}} approx 0.775
$$

We can compute the cosine similarity for all of the items in our dataset:

from sklearn.metrics.pairwise import cosine_similarity

# compute the cosine similarity
cos_sim = cosine_similarity(movies.iloc[:,3:])

The very first film in the dataset is Toy Story. Let’s find out what the similar
films to Toy Story are:

# Let's get the top 5 most similar films:
toystory_top5 = np.argsort(sim[0])[-5:][::-1]

# array([   0, 8219, 3568, 9430, 3000, 2809, 2355, 6194, 8927, 6948, 7760,
#       1706, 6486, 6260, 5490])

movies.iloc[toystory_top5]

MovieID	Genres
Toy Story (1995)	Adventure,Animation,Children,Comedy,Fantasy
Turbo (2013)	Adventure,Animation,Children,Comedy,Fantasy
Monsters, Inc. (2001)	Adventure,Animation,Children,Comedy,Fantasy
Moana (2016)	Adventure,Animation,Children,Comedy,Fantasy
Emperor’s New Groove, The (2000)	Adventure,Animation,Children,Comedy,Fantasy

The first five films all have exactly the same genre tags as Toy Story, and
hence a cosine similarity of 1. In fact, for the sample data used here, there
are thirteen films with similarity 1; the most similar film without identical
tags is 2006’s “The Ant Bully”, which has the additional genre tag “IMAX”.

Simple Collaborative Filtering

Collaborative filtering recommends items based on what similar users liked.
Fortunately, in the MovieLens dataset, we have a wealth of user preference
information in the form of movie ratings: each user assigns one or more films
numeric ratings between 1 and 5 indicating how much they enjoyed the film.
We can view the problem of recommending items to the user as a prediction
task: given the user’s ratings of other films, what is their likely rating of
the film in question?

One simple way to do this is to assign a similarity-weighted rating to each item
using other users’ ratings:

$$
hat{r}{u,i} = frac{sum{vneq u} s(u,v) r{v, i}}{sum{v neq u} s(u,v)}
$$

where $hat{r}_{u,i}$ is the predicted rating of item $i$ by user $u$, $s(u,v)$
is a measurement of similarity between users $u$ and $v$, and $r$ is the known
rating of item $i$ by user $v$.

For our user similarity measurement, we’ll look at users’ ratings of movies. Users
with similar ratings will be considered similar. To work with this rating data,
an important first step is to normalize our ratings. We’ll do this in three steps:
first, we’ll subtract the overall mean rating (across all films and users) so
that our adjusted ratings are centered at 0. Next, we’ll do the same thing for
each film, to account for the mean ratings of a given film differing. Finally
we’ll subtract off the mean rating for each user—this accounts for individual
variations (e.g. one user giving consistently higher ratings than another).

Mathematically, our adjusted rating $tilde{r}_{u,i}$ is

$$
tilde{r}{u,i} := r{u,i} – bar{r} – bar{r}{i} – bar{r}{u}
$$

where $r_{u,i}$ is the base rating, $bar{r}$ is the overall mean rating,
$bar{r},i$ is the mean rating for item $i$ (after subtracting the overall mean),
and $bar{r},u$ is the mean rating for user $u$ (after adjusting for the
overall mean rating and the item mean ratings). For convenience, I’ll refer to
the adjusted rating $tilde{r}$ as the preference of user $u$ for item $i$.

Let’s load in the ratings data and compute the adjusted ratings:

ratings = pd.read_csv("ratings.csv")

mean_rating = ratings['rating'].mean() # compute mean rating

pref_matrix = ratings[['userId', 'movieId', 'rating']].pivot(index='userId', columns='movieId', values='rating')

pref_matrix = pref_matrix - mean_rating # adjust by overall mean

item_mean_rating = pref_matrix.mean(axis=0)
pref_matrix = pref_matrix - item_mean_rating # adjust by item mean

user_mean_rating = pref_matrix.mean(axis=1)
pref_matrix = pref_matrix - user_mean_rating

At this point we can easily establish a reasonable baseline estimate for a
given user’s rating of films they haven’t seen:

pref_matrix.fillna(0) + user_mean_rating + item_mean_rating + mean_rating

We can compute the distance to a particular user (in this case, user 0)
as follows:

mat = pref_matrix.values
k = 0 # use the first user
np.nansum((mat - mat[k,:])**2,axis=1).reshape(-1,1)

It turns out that the nearest user is user 12 (with distance 0):

np.nansum((mat - mat[0,:])**2,axis=1)[1:].argmin() # returns 11
# check it:
np.nansum(mat[12] - mat[0]) # returns 0.0

We find two films that user 12 has seen that user 0 has not:

np.where(~np.isnan(mat[12]) & np.isnan(mat[0]) == True)
# returns (array([304, 596]),)

mat[12][[304, 596]]
# returns array([-2.13265214, -0.89476547])

Unfortunately, user 12 dislikes both of the films that user 0 hasn’t seen yet!
We should continue our computation to account for all of the nearby users.

Concluding Remarks

The methods used in this post are neighborhood-based, and we’ve just seen above
a potential pitfall when generating recommendations based on neighbors: neighbors
may not actually recommend any items the user in question hasn’t already seen.
Because of the need to compute pairwise distances, neighborhood-based methods
also tend to scale poorly as the number of users increases.

In part 2 of this series, we’ll take a look at another approach for building
recommender systems, this time using latent factor methods. Latent factor
models avoid some of the pitfalls of the neighborhood-based methods described
here—but as we’ll see, they come with some challenges of their own!

This post originally appeared on KDNuggets.

You might also enjoy

The EU AI Act is Approved: What is the Minimum Viable Governance That Global Enterprises Need to Comply With Regulatory Requirements?

Many equate governance with “process overhead” or “big brother” watchdogs, often stalling innovation and reducing productivity. So how does an enterprise get started with the right level of governance to protect the organization, without stymying innovation?

AI Regulations: What to Know & What to Do Now

Global, federal, and state-level governments are moving quickly to implement AI regulations. While reading this, you may be asking, “If I want to use AI, what do I need to do now to prepare my organization now?”

Benefits and Risks of Artificial Intelligence: 3 Lessons on Managing AI Risk and Reward

Three noteworthy conversations to help executives and AI leaders weigh the untapped potential of AI versus the risks.

Get the Latest News in Your Inbox