The Yhat Blog

machine learning, data science, engineering

Python Multi-armed Bandits (and Beer!)

by Eric Chiang |

There are many ways to evaluate different strategies for solving different prediction tasks. In our last post, for example, we discussed calibration and discrimination, two measurements which assess the strength of a probabilistic prediction. Measurements like accuracy, error, and recall among others are useful when considering whether random forest "works better" than support vector machines on a problem set. Common sense tells us that knowing which analytical strategy “does the best” is important, as it will impact the quality of our decisions downstream. The trick, therefore, is in selecting the right measurement, a task which isn't always obvious.

There are many prediction problems where choosing the right accuracy measurement is particularly difficult. For example, what’s the best way to know whether this version of your recommendation system is better than the prior version? You could - as was the case with the Netflix Prize - try to predict the number of stars a user gave to a movie and measure your accuracy. Another (simpler) way to vet your recommender strategy would be to roll I out to users and watch before and after behaviors.

So by the end of this blog post, you (the reader) will hopefully be helping me improve our beer recommender through your clicks and interactions.

The final application which this blog will explain can be found at The original post explaining beer recommenders can be found here.

A/B testing and the multi-armed bandit

A/B testing is by far the most common way to gage user preference. Whenever a new user enters your site randomly assign them to one of the model versions you want to test. If I have two different beer recommenders I can run a controlled experiment, producing different recommendations for different users, then later switch to the recommender that performs the best on some metric like number of clicks. Cool stuff.

Bandit algorithms on the other hand, are a slightly more complicated spin on A/B testing. The name is based on the one-arm bandit, aka a slot machine, so called because of the single lever on its side (and propensity to take your money).

The term multi-armed bandit comes from a theoretical problem where instead of one slot machine lever, you have a number of them, say three like in the image above. If you know the odds and payouts, it's trivial to determine the lever with the highest expected value. But imagine you didn't know the odds (like not knowing which model preforms best). A bandit algorithm attempts to solve this problem and maximize profit by systematically testing different levers and remembering the rewards.

So what does this have to do with beer recommenders? What the analogy of a multi-armed slot machine captures well is it costs to test your hypotheses. A slot machine takes a coin to play. A website costs you a visit. Testing gains you knowledge, but will sometimes necessitate supplying your customers with an inferior product. In response, bandit algorithms are automated to move away from low scoring options and inherently prefer those that preform better.

Of course you can do this with an A/B test. And if you're only going to run one or two test over a short period of time which you track frequently, an A/B test will often perform just as well as a bandit algorithm. But as the number and length of hypotheses and tests increase, the advantage of an automatic feedback loop allows the designer to take a more hands off approach.


Despite its simplicity, the epsilon-greedy algorithm does a good job of encapsulating the spirit of bandit algorithms. Every time the algorithm has to choose an option (also referred to as an arm), it first considers two possibilities; explore or exploit. Explore corresponds to testing, and if epsilon-greedy takes this path it simply chooses an arm at random. The exploit path on the other hand leads only to the arm with the best performance so far, the safe bet.

The "best arm" is considered the option (in our case a recommendation model) with the highest probability of reward. I'll define my feedback system later, but for now think of reward as akin to \(clicks/views\). Below is a probabilistic diagram which expresses this behavior. Notice that once the system lands on the "testing" node, it's synonyms with generalized A/B testing.

Epsilon-greedy will choose to test with a frequency of epsilon (hence the name). An epsilon value of \(1.0\) will cause it to always test, while a value of \(0.0\) will cause it to always exploit the best preforming arm.

Here's a Python implementation of epsilon-greedy which will be used in the final application.

In [1]:
__filename__ = ""

import numpy as np

class EpsilonGreedy(object):
    def __init__(self,n_arms,epsilon_decay=50):
        self.counts = [0] * n  # example: number of views
        self.values = [0.] * n # example: number of clicks / views
        self.decay = epsilon_decay
        self.n = n_arms

    def choose_arm(self):
        """Choose an arm for testing"""
        epsilon = self.get_epsilon()
        if np.random.random() > epsilon:
            # Exploit (use best arm)
            return np.argmax(self.values)
            # Explore (test all arms)
            return np.random.randint(self.n)

    def update(self,arm,reward):
        """Update an arm with some reward value""" # Example: click = 1; no click = 0
        self.counts[arm] = self.counts[arm] + 1
        n = self.counts[arm]
        value = self.values[arm]
        # Running product
        new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
        self.values[arm] = new_value

    def get_epsilon(self):
        """Produce epsilon"""
        total = np.sum(arm_counts)
        return float(self.decay) / (total + float(self.decay))

While epsilon is normally a fixed value, I've implemented a method called get_epsilon() which produces a variable value based on the total amount of feedback. Let's take a look at what the embedded equation looks like using ggplot.

In [2]:
from ggplot import *
%matplotlib inline

ggplot(pd.DataFrame({'x':[0,500]}),aes(x='x')) + \
    stat_function(fun = lambda x: 50. / (x + 50.)) + \
    ggtitle("Epsilon decay over time") + \
    xlab("amount of feedback") + ylab("epsilon (probability of testing)")
<ggplot: (288370449)>

As the amount of feedback (example: number of views) increases epsilon gets smaller and smaller. Since epsilon represents the likelihood of testing, this function implies that the longer we leave the bandit algorithm running the more likely it will be to exploit rather than test. This makes sense doesn't it? The more feedback we receive, the more confident we can be that we've found the correct recommender, and over time we should faze out the testing of low scoring versions.

Three different beer recommenders

Our previous blog on recommenders already goes into quite a bit of depth so I'm going to be brief here. In the Python example below, I've created a data frame where each row represents a beer, each column represents a user and each cell holds the review score a user gave a beer (with zeros if they haven't reviewed it).

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("beer_reviews.csv")
# Limit to the top 250 beers
top_n = df.beer_name.value_counts().index[:250]
df = df[df.beer_name.isin(top_n)]

# Create a matrix of beer vs. reviews; values are overall review score
df_wide = pd.pivot_table(df, values=["review_overall"],
        rows=["beer_name", "review_profilename"],
df_wide = df_wide.fillna(0)

The main emphasis of our recommendation system is the distance between beers. By distance, I mean the distance between the two rows representing each beer. This eventually boils down to comparing two vectors and their "similarity". In simple terms, beers that are closer by some metric should be more related and therefore a better recommendation.

But in multidimensional space there are a lot of ways to measure the distance and similarity between two points. Here I pick three measures: Euclidean distance, distance correlation, and cosine similarity. From there, I create three different distance matrices and define a helper function.

In [4]:
from sklearn.metrics.pairwise import (
    euclidean_distances, cosine_similarity, pairwise_distances

# Calculate euclidean distance between each beer
eucl_dists = euclidean_distances(df_wide)
eucl_dists = pd.DataFrame(eucl_dists, columns=df_wide.index)
eucl_dists.index = eucl_dists.columns

# Calculate cosine similarity
cos_dists = cosine_similarity(df_wide)
cos_dists = pd.DataFrame(cos_dists, columns=df_wide.index)
cos_dists.index = cos_dists.columns

# Calculate distance correlation
corr_dists = pairwise_distances(df_wide,metric='correlation')
corr_dists = pd.DataFrame(corr_dists, columns=df_wide.index)
corr_dists.index = corr_dists.columns

# Use distance matrix to determine similarity
def get_sims(products, dists, larger_is_closer=False):
    """Return similarity matrix"""
    p = dists[products].apply(lambda row: np.sum(row), axis=1)
    p = p.order(ascending=larger_is_closer)
    return p.index[p.index.isin(products)==False]

As shown in this next snippet, different similarity measurements will produce different results for the same set of products. This is what I'll be using the bandit algorithm to test for.

In [5]:
products = ["Sierra Nevada Pale Ale", "120 Minute IPA", "Coors Light"]

eucl_prods = get_sims(products,eucl_dists)[:10]
cos_prods = get_sims(products,cos_dists,larger_is_closer=True)[:10]
corr_prods = get_sims(products,corr_dists)[:10]

print "Products similar to:", ', '.join(products)
pd.DataFrame({'Euclidean Distance': eucl_prods,
        "Cosine Similarity":cos_prods,
        "Distance Correlation":corr_prods})
Products similar to: Sierra Nevada Pale Ale, 120 Minute IPA, Coors Light
Cosine Similarity Distance Correlation Euclidean Distance
0 Vanilla Porter The Abyss Pliny The Elder
1 New Holland Dragons Milk Oak Barrel Ale Vanilla Porter 90 Minute IPA
2 Supplication Supplication Old Rasputin Russian Imperial Stout
3 The Abyss New Holland Dragons Milk Oak Barrel Ale Bells Hopslam Ale
4 Terrapin Coffee Oatmeal Imperial Stout Terrapin Coffee Oatmeal Imperial Stout Founders Breakfast Stout
5 Founders Porter Founders Porter Two Hearted Ale
6 Raging Bitch Belgian-Style IPA Furious Sierra Nevada Celebration Ale
7 Furious Sculpin India Pale Ale Duvel
8 Founders Backwoods Bastard Founders Backwoods Bastard La Fin Du Monde
9 Dark Lord Imperial Stout Raging Bitch Belgian-Style IPA Stone Ruination IPA

10 rows × 3 columns

Finally time to have some fun with the yhat library. By subclassing YhatModel I can throw three different versions of my recommendation system on our public cloud to be called later. Notice that the only difference is the line that includes get_sims, where each function calls a different similarity matrix.

In [6]:
from yhat import Yhat, YhatModel, preprocess

class EuclideanBeerRec(YhatModel):
    @preprocess(in_type=dict, out_type=dict) 
    def execute(self, data):
        beers = data.get("beers",[])
        suggested_beers = get_sims(beers,eucl_dists)[:10]
        result = []
        for beer in suggested_beers:
            result.append({"beer": beer})
        return result

class CosineBeerRec(YhatModel):
    @preprocess(in_type=dict, out_type=dict) 
    def execute(self, data):
        beers = data.get("beers",[])
        suggested_beers = get_sims(beers,cos_dists,larger_is_closer=True)[:10]
        result = []
        for beer in suggested_beers:
            result.append({"beer": beer})
        return result

class CorrelationBeerRec(YhatModel):
    @preprocess(in_type=dict, out_type=dict) 
    def execute(self, data):
        beers = data.get("beers",[])
        suggested_beers = get_sims(beers,corr_dists)[:10]
        result = []
        for beer in suggested_beers:
            result.append({"beer": beer})
        return result

yh = Yhat("__username__","__apikey__","")

The web app

Ultimately, the reason I'm using yhat is because I don't just want to train my bandit algorithm on my laptop, I want to build a web application so you can help. Here's what the setup needs to look like. Every time a new user enters the system, epsilon-greedy will instruct the application to call one of my three deployed models. The application then should produce a set of predictions using that recommender and display them to the user. The user does some stuff which results in feedback for epsilon-greedy.

I've modified our original beer recommender flask app ( to do just that. The POST request handles a user's request for beer recommendations. After consulting with epsilon-greedy, it calls the chosen model through yhat simply by switching out the name of the arm.

The full application can be found on our github.

In [7]:
__filename__ = ""

from flask import Flask, request, render_template, url_for, Response, json
from yhat import Yhat
from uuid import uuid4

app = Flask(__name__)

if __name__ == '__main__':
    arms = ["EuclieanBeerRec","CosineBeerRec","CorrelationBeerRec"]
    # epsilon-greedy is stateful and will sit in the webapp
    from bandits import EpsilonGreedy
    bandit = EpsilonGreedy(len(arms))
    # Construct a yhat object
    yh = Yhat("__username__", "__apikey__", "")
    ids = {}
    # start the app

@app.route('/', methods=['GET', 'POST'])
def index():
    if request.method == 'POST':
        """Generate recommendations for the user"""
        # Choose an arm
        arm = bandit.choose_arm()
        arm_name = arms[arm]
        # Predict that arm through yhat
        pred = yh.predict(arm_name, {"beers": request.json['beers']})
        # Some uuid stuff so the user doesn't know which arm they're on
        u_id = str(uuid4())
        ids[u_id] = { 'arm':arm, 'arm_name':arm_name }
        # Pass the results to the user
        return Response(json.dumps({'result':pred['result'],
    elif request.method == 'GET':
        """ Render page for user """

And that's it. Go and head over to and help me train a live epsilon-greedy algorithm. With your interactions, the best preforming of my three models should rise to the top. And since it doesn't require my involvement, I'm off to lunch.

Defining a reward metric

Oh right, how do user clicks translate to "reward" for each arm of the bandit algorithm? Every time a user gets a recommendation, they'll be asked to pick the most relevant beer to their query. The ranking of the chosen beer determines the reward for the model. If the selected beer is the top item, the algorithm gets a "reward" of 1. If the beer is the last item, it gets a reward of 0. Beers in-between generate rewards as show by the graph below.

In [10]:
x = np.arange(10) # Possible position of beer
y = (1 - (x / 9.)) ** 2

ggplot(pd.DataFrame({'item rank':x,'reward':y}),aes(x='item rank',y='reward')) + \
    geom_line() + ggtitle("Chosen item rank vs. bandit alogrithm reward")
<ggplot: (289036181)>

Final Thoughts

This post was inspired by John Myles White's Bandit Algorithms for Website Optimization. Wonderfully short and concise, I'm now convinced that all O'Reilly publications should have an alternative version that's under 100 pages. Definitely go check it out if you haven't already.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.