The Yhat Blog

machine learning, data science, engineering

Recommendation System in R

by yhat |

Recommender systems are used to predict the best products to offer to customers. These babies have become extremely popular in virtually every single industry, helping customers find products they'll like. Most people are familiar with the idea, but nearly everyone is exposed to several forms of personalized offers and recommendations each day (Google search ads being among the biggest source).

Building recommendation systems is part science, part art, and many have become extremely sophisticated. Such a system might seem daunting for those uninitiated, but it's actually fairly straight forward to get started if you're using the right tools.

This is a post about building recommender systems in R.

UPDATE: We used the beer / product recommender for a talk at PyData Boston in July.
IPython notebook here:

Beer Dataset

"Respect Beer." -

For this example, we'll use data from Beer Advocate, a community of beer enthusiasts and industry professionals dedicated to supporting and promoting beer. The data is made available to us via Stanford's web data library. It consists of ~1.5 millions reviews posted on BeerAdvocate from 1999 to 2011.

Each record is composed of a beer's name, brewery, and metadata like style and ABV etc., along with ratings provided by reviewers. Beers are graded on appearance, aroma, palate, and taste plus users provide an "overall" grade. All ratings are on a scale from 1 to 5 with 5 being the best.

In addition to these numerical ratings, users are required to write a short paragraph of 250 to 5,000 characters describing their overall impressions. While the text does provide some excellent opportunities for analysis, we're going to focus only on the ratings for this post. You can read more about their rating system here.

Formatting the Data

This part always takes longer than you'd like, but luckily the beer dataset is pretty clean.

Not that many nulls, and the text fields are free of strange byte characters in them (always throws me off). One thing that's a little different is that the data is laid out row-wise instead of column-wise.

Records are delimited by newlines and have one key/value pair per line. I wrote a short Python script to handle parsing which leaves us with a nicely formatted .csv file.

Since I'm working with a lot of data, I decided to throw it into a database.

I'm working with Postgres, but any relational database will do the trick.

Getting the Breweries

One unfortunate part about the dataset is that it only includes a brewerid and no lookup table for the ids. These correspond with pages on the brewery profiles. For example, the Sierra Nevada Brewing Co. has a brewerid of 140 and their page on beer advocate is /profile/140.

In any case, what we really need is the brewery name associated with each id, which means doing a little web scraping. I really didn't want to get into installing any Postgres programming clients (psycopg2 for Python), so I wrote a short bash script to grab the brewery ids.

It's far from ideal, but it's short, simple, and it works. You can skip this part if you prefer and just download the data here.

Loading it into R

We've got everything in a database. Nice!

We're going to use the excellent RPostgreSQL driver which makes it super easy to query Postgres from R. You'll notice we've got a little sub-query action going on here. All our sub-query is doing is grabbing all beers with 500+ reviews.

Let's take a peak at the data using the head command. As you can see, we've got a few Colorado Kool-Aids at the top of the heap (not surprising, it's a pretty popular beer).

You can see we've got the id, name, and brewery of each beer and the associated review data provided by a given user as denoted by the review_profilename column.

Finding Similarities

The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we're going to use collaborative filtering. We're going to compare 2 beers by ratings submitted by their common reviewers. Then, when one user writes similar reviews for two beers, we'll then consider those two beers to be more similar to one another.

We'll need a function which takes two beers and returns their mutual reviewers (or sameset). To do this, we'll use the intersect function in R which finds common elements between two lists or vectors.

I wrote two functions: common_reviewers_by_id to extract the sameset given two beer_ids, and common_reviewers_by_name to extract the samesets given two beer_names. For programming purposes it's easier to use common_reviewers_by_id, but for testing and spot checking, common_reviewers_by_name is handy.

Next we need a function to extract features for a given beer. Features, in this case, are the 1 to 5 numerical ratings provided by users as part of each beer's review.

Two things probably stick out in this function. (1) We're sorting the data by the reviewers username. This is so that when we extract features for say, Coors Light and Founders Double Trouble, the reviews in indicies 0, 1, 2, ..., N correspond with reviews made by the same users.

(2) We're de-duplicating the reviews based on profile name. There are a few instances of users reviewing the same beer twice. Since we want the review data across beers to be aligned, we're just going to throw out any instances of multiple reviews by a user for the same beer.

Given two beers, we look at the similarity between how reviewers clocked-in with each of the 1 to 5 ratings.

To give you a visual, take a look at the charts below.

Users who like Fat Tire tended to not like Michelob Ultra as much.

The x-y coordinates correspond with how users rated each of the two beers. For example, a person who rated Fate Tire a 4.5 overall and Michelob Ultra a 2.5 overall appears as a point found at (4.5, 2.5) in top left quadrant of the first graphic above. The size of the dots correspond to the number of reviewers that wound up in a given bucket.

Users tend to rate Fat Tire higher than Michelob Ultra, as illustrated by the majority of points found below the center line.

However, when we compare Fat Tire to Dale's Pale Ale, we get a different story. We see that reviewers tended to rate both more or less consistently. Points are closer to the center line than those found in the Fat Tire-Michelob comparison. Intuitively, this suggests that it would be better to recommend Dale's Pale Ale to someone who likes Fat Tire than to someone who likes Michelob Ultra.

Quantifying Our Beliefs

I don't need a statistical model to tell me that someone who likes Fat Tire is probably going to like Dale's Pale Ale more than Michelob Ultra. But what about picking between Dale's Pale Ale and Sierra Nevada Pale Ale? Things get a little more complicated. For this reason (and because we don't want to manually select between each beer pair), we're going to write a distance function that will quantify similarity.

For our similarity metric we're going to use a weighted average of the correlation of each metric. In other words, for each two-beer-pair we calculate the correlation of review_overall, review_aroma, review_palate, and review_taste separately. Then we take a weighted average each result to consolidate them into one number.

We're going to weight review_overall with 2 and the remainder will have a weight of 1. This gives review_overall 40% of the score (NOTE: this is totally arbitrary, you can use whatever weighting function you want. A lot of times the simplest stuff works the best in my experience).

Computing Similarity Across All 2-Beer-Pairs

To keep things simple, we're only going to compare the 20 most commonly reviewed beers in the example code. This will give us enough data to make sure everything is working as expected, but it's still a small enough sample size that it won't take too long to compute.

The first thing we do is define the 20 beers we want to use. Then we use expand.grid to create all of the combinations between the beers. Finally we remove any self-to-self comparisons (if you like Dale's Pale Ale, it wont' help you very much if we recommend Dale's Pale Ale). We're then going to use ddply to do a map/reduce style calculation on the data. Note that it's possible to parallelize ddply. Although we're not doing it here, in an upcoming post I'll show you how to do run ddply in parallel using EC2.

I wrote a short helper function find_similar_beers that accepts a beer you like and optionally a number of suggested beers and a desired style, and returns the most similar beers in a nice format.

Deploying to Yhat

Deploying this particular model was really easy. I just wrapped my find_similar_beers function in the yhat.predict function, added my apikey, and that was it. I didn't even need to use the yhat.require or yhat.transform functions.

Getting Your Recommendations

To make recommendations on the web, I wrote a quick app with Heroku and Flask that consumes the Yhat API. You can see some of that javascript below, or you can check out the standalone app here.

Beer Recommender

Final Thoughts

A great resources for building recommender systems is Programming Collective Intelligence by Toby Segaran. The book is a few years old, but it's a phenomenal introduction to some of the basics in machine learning. Chapter 2 gives a great overview of recommendation systems and how you can use them. Another good read is Machine Learning for Hackers by Drew Conway and John Myles White. Check out chapter 10 for recommender systems.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.