The Yhat Blog

machine learning, data science, engineering

What is Singular Value Decomposition?

by Tyler Manning-Dahan |

Recommendation engines are all the rage. From Netflix to Amazon, all of the big guys have been pushing the envelope with research initiatives focused on making better recommendations for users. For years, most research appeared through academic papers or books that neatly organized these papers into their respective techniques (e.g. collaborative filtering, content filtering, etc.) to make them easier to digest. There have actually been very few pure text books on the subject given it is a fairly new research area.

In 2016, Charu Aggarwal published Recommender Systems: The Textbook, a massively detailed walkthrough of recommendation systems from the basics all the way to where research is at today. I highly recommend it to anyone interested in recommendation systems, whether you are doing research or just want to gain some intuition, his explanations are fantastic!

In chapter 3 of his book, Aggarwal discusses model-based collaborative filtering, which includes several methods of modelling the classic user-item matrix to make recommendations. One focus of the chapter is on matrix factorization techniques that have become so popular in recent years. While introducing unconstrained matrix factorization, he remarks the following:

Much of the recommendation literature refers to unconstrained matrix factorization as singular value decomposition (SVD). Strictly speaking, this is technically incorrect; in SVD, the columns of $U$ and $V$ must be orthogonal. However, the use of the term “SVD” to refer to unconstrained matrix factorization is rather widespread in the recommendation literature, which causes some confusion to practitioners from outside the field.

Aggarwal - Section 3.6.4 of Recommender Systems (2016)

Before getting into more details about the inconsistency remarked by Aggarwal, let's go over what singular value decomposition (SVD) is and what plain old matrix factorization is.

Matrix factorizations all perform the same task but in different ways. They all take a matrix and break it down into some product of smaller matrices (its factors). It's very similar to how we did factoring in elementary school. We took a big number like $12$ and broke it down into its factors ${(1,12),(2,6),(3,4)}$, where each pair yields the number $12$ when multiplied together. Factorizing matrices is exactly the same but since we are breaking something like a matrix that is inherently more complex, there are many, many ways to perform this break down. Check out Wikipedia for all the different examples.

Visualization of the SVD of a two-dimensional, real shearing matrix M.

How you factor a matrix basically comes down to what constraints you put on the factors (the matrices that when multiplied together form the original). Do you want just $2$ factors? Do you want more? Do you want them to have particular characteristics like orthogonality? Do you want their eigenvectors to have any specific things?

One of the most basic ways to factor a matrix that comes up often in recommendation research is the so called low-rank approximation and it goes like this: If $A$ is an $m$ by $n$ matrix of rank $r$ then it can be expressed as: \begin{equation} A = P Q^T \end{equation} Where matrix $P$ is of size $m$ by $r$ and $Q$ is of size $n$ by $r$. Practically, this is useful when $A$ contains huge amounts of data. The size of matrices $P$ and $Q$ will be much smaller than $A$ because $r << \min \text{{m,n} }$. This allows us to store $m \cdot r$ amount of entries from $P$ and $n\cdot r$ from $Q$, which is much more efficient than storing the entries of $A$ which total $m \cdot n$.

Now, SVD is just another flavour of matrix factorization and it has been around in math for a long time. The theorem states: If $A$ is any matrix of size $m$ by $n$. It can be square or rectangular and it's rank is $r$, then $A$ can be reduced by: \begin{equation} A = U\Sigma V^T \end{equation} Where matrix $U$ is of size $m \times m$, $V$ is of size $n \times n$, and their columns are orthonormal. $\Sigma$ is a diagonal matrix of size $m \times n$, containing diagonal entries, $\sigma_i$, that are singular values of $A$.

That definition is a mouthful and there is a lot we could talk about, including how to diagonalize $A$ and what all the properties really mean spatially, but I will save that for another day. For now, I just want to highlight that this i s THE definition of SVD (check any linear algebra book or Wikipedia) and if someone says they factored a matrix by SVD, you should mentally envision that formula and those constraints.

What's happened in the recommendation system research area is most papers now refer to the more basic factorization as SVD.

One possible place that may have started the confusion was during the Netflix Prize competition that pushed out a ton of research in a short span of time. One researcher that was pushing the SVD algorithm for recommendations was Simon Funk, who had a periodical blog about it.

Popular papers by Arkadiusz Paterek referenced Funk's work and carried the mistake in identity by defining it as:

In the regularized SVD predictions for user $i$ and movie $j$ are made in the following way: \begin{equation} \hat y_{ij} = u^T_i v_j \end{equation} where $u_i$ and $v_j$ are $K$-dimensional vectors of parameters.

Paterek - Section 3.2 of Improving regularized singular value decomposition for collaborative filtering (2007)

In the ground-breaking paper by Koren, Bell, and Volinsky that was published in 2009, the authors specify the basic factorization model identically to Paterek (albeit with a slightly different syntax):

[...] $q^T_i p_u$, captures the interaction between user $u$ and item $i$, the user's overall interest in the item's characteristics. This approximates user $u$'s rating of item $i$, which is denoted by ${r_u}_i$, leading to the estimate \begin{equation} \hat {r_u}_i = q^T_i p_u \end{equation}

And follows it up with this explanation:

Such a model is closely related to singular value decomposition (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix.

Already you could see how SVD's identity was changing. Now this is isn't to blame anyone or point fingers. Often researchers use the same terms to define different things, which isn't usually a problem because the context is right in front of you to clear it up.

The problem arises when software developers make libraries out of these great papers and don't necessarily read the fine print. We get something like this SVD algorithm from the Surprise library. (note the creator of Surprise, Nicolas Hug, is a great guy and helped me with some of my work!). Like many others, this one is based on Funk's original take on SVD for recommendations.

Even the Java library for recommendations, librec, implements a regularized SVD according to Paterek's paper.

Compare this now to how SVD is implemented in SciPy, one of Python's math libraries for doing linear algebra. They define it similar to the classic math definition outlined above:

scipy.linalg.svd(a, full_matrices=True, compute_uv=True, overwrite_a=False, check_finite=True, lapack_driver='gesdd')

Singular Value Decomposition.

Factorizes the matrix a into two unitary matrices U and Vh, and a 1-D array s of singular values (real, non-negative) such that a == USVh, where S is a suitably shaped matrix of zeros with main diagonal s.

In the end, this comes back to what Aggarwal pointed out. In recommendation systems research, SVD has been defined differently compared to the classic mathematical definition that people may have learned in their Linear Algebra courses.

If you are implementing some kind of recommendations or factorizing matrices, be sure to double-check what SVD you are using so your results are consistent!

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.