# Understanding matrix factorization for recommendation (part 1) - preliminary insights on PCA

- 11 mins**Foreword**: this is the first part of a 4 parts series. Here are parts 2, 3 and 4.
This series is an extended version of a talk I
gave at PyParis 17.

About ten years ago, Netflix launched the Netflix Prize: an open contest where the goal was to design state-of-the-art algorithms for predicting movie ratings. During 3 years, research teams developed many different prediction algorithms, among which matrix factorization techniques stood out by their efficiency.

**The goal of this series of posts is twofold**:

- Give some insights on how matrix factorization
**models**the ratings. To this end, we will illustrate how PCA and SVD work, using concrete examples. You may have already read explanations like*users and items are represented as vectors in a latent factors vector space, and ratings are defined as a dot product between two vectors*. If it makes no sense to you, I hope that you will understand what it means by the end of this article. - Derive and implement an algorithm for predicting ratings, based on matrix factorization. In its simplest form, this algorithm fits in 10 lines of Python. We will use this algorithm and evaluate its performances on real datasets.

I tried to keep the math level of the article as accessible as possible, but without trying to over-simplify things, and avoiding dull statements. My hope is that this article is accessible to ML beginners, while still being insightful to the more experienced.

**This article is divided into 4 parts**: in this first part, we will
clearly define the problem we plan to address, and provide some insights about
PCA. In the second part, we will
review SVD and see how it models the ratings. In the third part, we will
see how to apply our knowledge of SVD to the rating prediction task, and derive
an implementation of a matrix-factorization-based algorithm. In the last
part, we will implement a matrix
factorization algorithm in Python using the Surprise
library.

# The problem

The problem we propose to address here is that of **rating prediction**. The
data we have is a rating history: ratings of users for items in the interval
$[1, 5]$. We can put all this data into a sparse matrix called $R$:

Each row of the matrix corresponds to a given user, and each column corresponds
to a given item. For instance here, Alice has rated the first item with a
rating of $1$, and Charlie has rated the third item with a rating of $4$. In
our case, we will consider that the items are **movies** and we will use the
terms *item* and *movie* interchangeably.

The matrix $R$ is sparse (more than 99% of the entries are missing), and **our
goal is to predict the missing entries**, i.e. predict the
$\color{#e74c3c}{?}$.

There are many different techniques for rating prediction, and in our case we
will **factorize** the matrix $R$. This matrix factorization is fundamentally
linked to **SVD**, which stands for Singular Value Decomposition. SVD is one
of the highlights of linear algebra. It’s a beautiful result. When people tell
you that math sucks, show them what SVD can do.

The aim of this article is to explain how SVD can be used for rating prediction purposes. But before we can dive into SVD in the second part, we need to review what PCA is. PCA is only slightly less awesome than SVD, but it is still really cool.

# A little bit of PCA

Let’s forget the recommendation problem for 2 minutes. We’ll play around with the Olivetti dataset. It’s a set of greyscale images of faces from 40 people, making up a total of 400 images. Here are the first 10 people:

Friendly, right? Well, just wait a little bit…

Each image size is 64 x 64 pixels. We will flatten each of these images (we thus get 400 vectors, each with 64 x 64 = 4096 elements). We can represent our dataset in a 400 x 4096 matrix:

**PCA, which stands for Principal Component Analysis, is an algorithm that will
reveal 400 of these guys**:

Creepy, right ;)?

We call these guys the **principal components** (hence the name of the
technique), and when they represent faces such as here we call them the
**eigenfaces**. Some really cool stuff can be done with eigenfaces such as
face recognition, or optimizing
your tinder
matches!
The reason why they’re called *eigenfaces* is because they are in fact the
eigenvectors of the covariance matrix of $X$ (but we will not detail this, see
the references if you want to
dive further into it).

We obtain here 400 principal components because the original matrix $X$ has 400 rows (or more precisely, because the rank of $X$ is 400). As you may have guessed, each of the principal component is in fact a vector that has the same dimension as the original faces, i.e. it has 64 x 64 = 4096 pixels.

As far as we’re concerned, we will call these guys the **creepy guys**. Now,
one amazing thing about them is that **they can build back all of the original
faces.** Take a look at this (these are animated gifs, about 10s long):

Here is what’s going on. Each of the 400 original faces (i.e. each of the 400 original rows of the matrix) can be expressed as a (linear) combination of the creepy guys. That is, we can express the first original face (i.e. its pixel values) as a little bit of the first creepy guy, plus a little bit of the second creepy guy, plus a little bit of third, etc. until the last creepy guy. The same goes for all of the other original faces: they can all be expressed as a little bit of each creepy guy. Mathematically, we write it this way:

The gifs you saw above are the very translation of these math equations: the first frame of a gif is the contribution of the first creepy guy, the second frame is the contribution of the first two creepy guys, etc. until the last creepy guy.

If you want to play around a bit with the creepy guys and reproduce the cool gifs, I made a small notebook for you.

Note: there is nothing really special about the fact that the creepy guys can express all the original faces. We could have randomly chosen 400 independent vectors of 64 x 64 = 4096 pixels and still be able to reconstruct any face. What makes the creepy guys from PCA special is a very important result about dimensionality reduction, which we will discuss in part 3.

#### Latent factors

We’ve actually been kind of harsh towards the creepy guys. They’re not creepy,
they’re **typical**. The goal of PCA is to reveal typical vectors: **each of
the creepy/typical guy represents one specific aspect underlying the data**. In
an ideal world, the first typical guy would represent (e.g.) a *typical elder
person*, the second typical guy would represent a *typical glasses wearer*, and
some other typical guys would represent concepts such as *smiley*, *sad
looking*, *big nose*, stuff like that. And with these concepts, we could define
a face as more or less *elder*, more or less *glassy*, more or less *smiling*,
etc. In practice, the concepts that PCA reveals are really not that clear:
there is no clear semantic that we could associate with any of the
creepy/typical guys that we obtained here. But the important fact remains:
**each of the typical guys captures a specific aspect of the data**. We call
these aspects the **latent factors** (latent, because they were there all the
time, we just needed PCA to reveal them). Using barbaric terms, we say that
each principal component (the creepy/typical guys) captures a specific latent
factor.

Now, this is all good and fun, but we’re interested in matrix factorization for
recommendation purposes, right? So where is our matrix factorization, and what
does it have to do with recommendation? PCA is actually a plug-and-play method:
it works for any matrix. If your matrix contains images, it will reveal some
typical images that can build back all of your initial images, such as here.
**If your matrix contains potatoes, PCA will reveal some typical potatoes that
can build back all of your original potatoes**. If your matrix contains
ratings, well… Here we come.

# PCA on a (dense) rating matrix

Until stated otherwise, we will consider for now that our rating matrix $R$ is completely dense, i.e. there are no missing entries. All the ratings are known. This is of course not the case in real recommendation problems, but bare with me.

#### PCA on the users

Here is our rating matrix, where rows are users and columns are movies:

Looks familiar? Instead of having faces in the rows represented by pixel
values, we now have users represented by their ratings. Just like PCA gave us
some typical guys before, it will now give us some **typical users**, or rather
some **typical raters**.

Here again, in an ideal world, the concepts associated with the typical users
would have a clear semantic meaning: we would obtain a typical *action movie
fan*, a typical *romance movie fan*, a typical *comedy fan*, etc. In practice,
the semantic behind the typical users is not clearly defined, but for the sake
of simplicity we will assume that they are (it doesn’t change anything, this is
just for intuition/explanation purposes).

So here we are: each of our initial users (Alice, Bob…) can be expressed as a combination of the typical users. For instance, Alice could be defined as a little bit of an action fan, a little bit of a comedy fan, a lot of a romance fan, etc. As for Bob, he could be more keen on action movies:

And the same goes for all of the users, you get the idea. (In practice the coefficients are not necessarily percentages, but it’s convenient for us to think of it this way).

#### PCA on the movies

What would happen if we transposed our rating matrix? Instead of having users in the rows, we would now have movies, defined as their ratings:

In this case, PCA will not reveal typical faces nor typical users, but of
course **typical movies**. And here again, we will associate a semantic meaning
behind each of the typical movies, and these typical movies can build back all
of our original movies:

And the same goes for all the other movies.

We are now ready to dive into SVD in the next part.