# ML Pitfalls: Measuring Performance (Part 1)

#### by Eric |

Unfortunately, analysis lives and dies by self-reported metrics.

Is this feature A better than feature B? Is this classifier better than another? How much confidence can I have in this financial report? From the development to the consumption, almost every decision regarding analytics inherently asks "How good is this model?"

"How good" can mean a lot of things and it varies over domain and problems sets. But it is the developer's responsibility to provide a fair measurement in the first place. That task is surprisingly easy to mess up.

So before you use all sorts of fancy software to run on your biggest computing cluster, make sure your "How good" is accurate.

### Getting started

There's no data analysis without data, so let's grab some and get started. You can download a copy here.

```
import numpy as np
import pandas as pd
df = pd.read_csv("heights_and_weights.csv")[:100]
df.head()
```

HHX | FMX | FPX | SEX | BMI | SLEEP | educ | height | weight | |
---|---|---|---|---|---|---|---|---|---|

0 | 16 | 1 | 2 | 1 | 33.36 | 8 | 16 | 74 | 260 |

1 | 20 | 1 | 1 | 1 | 26.54 | 7 | 14 | 70 | 185 |

2 | 69 | 1 | 2 | 2 | 32.13 | 7 | 9 | 61 | 170 |

3 | 87 | 1 | 1 | 1 | 26.62 | 8 | 14 | 68 | 175 |

4 | 88 | 1 | 1 | 2 | 27.13 | 8 | 13 | 66 | 168 |

For these examples we'll concentrate on analytical methods, so the data we're working with isn't that important. In practice this is the wrong approach to take. Bad assumptions about your data can corrupt perceived accuracy as much as anything else, but we'll save that for part 2.

A good place to start is to separate the data into "predictive features" and "thing to predict." In this case the thing we'll be trying to predict is "weight", and the rest of the columns will be the things we'll use to try to predict it. Each row is an observation or a single patient.

```
feats = df.drop("weight", axis=1) # drop the weight from the feature matrix
# Create a good old X and y combination. Our features and our targets.
X = feats.values
y = df["weight"].values
```

Let's fit a model. Because of the nice, clean data we started with (and scikit-learn's standard interfaces) this is super straight-forward.

```
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X) # always regularize your features
lr = LinearRegression()
lr.fit(X_scaled, y)
# LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
```

Okay, moment of truth. How good is this model?

### A simple measure: $R^2$

There are a lot of measurements we can use, but fundamentally they all ask the same question: how close is the model's prediction to the truth? In our case, how close is the predicted weight of a person to their actual weight?

Let's make a prediction on our observation set with our trained model `lr`

.

```
y_pred = lr.predict(X_scaled)
```

Next we have to pick a metric. I rolled a die, and decided to go with $R^2$. This metric uses some fancy calculations with squares and means, but the biggest trait to note is that it penalizes outliers more heavily than other methods. It produces a single number: a value of $1.0$ for perfectly performing model with lower values being worse.

```
from sklearn.metrics import r2_score
r2_score(y, y_pred)
# 0.734755
# NOTE: Plotting functions are defined at the bottom of this article.
plot_r2(y, y_pred, "Performance of Linear Regression")
```

The $x$ value on this plot shows the actual "weight" of that person, and the $y$ axis shows what the LinearRegression predicts. The blue line represents a perfect prediction. Seems this model does alright, but more importantly we have a number to compare against future attempts to improve.

Now, serious analysis calls for serious models. Let's import a much more serious algorithm, K-Nearest Neighbors, and see how it does.

We'll use the same method for evaluating Nearest Neighbors as Linear Regression.

```
from sklearn.neighbors import KNeighborsRegressor
# a very serious algorithm
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_scaled, y)
# how good is k-nearest neighbors?
y_pred = knn.predict(X_scaled)
plot_r2(y, y_pred, "Performance of KNN (K=1)")
```

As shown in the graph above K-Nearest Neighbors regression does much better, predicting the weight of each individual perfectly.

### Please don't overfit

So.... what just happened?

In short, we overfit. By training and predicting on the same data, it's easy for a model to seem much better than it actually is. This makes sense doesn't it? We showed the model the weight of an individual during training, then asked it the weight of the same person. It's like known the test answers a priori.

What we'd like to do is fit our model on one set of data, then measure accuracy with another set. We need some holdout data.

A holdout set is exactly what it sounds like: prior to training our model, we set aside a subset of data and then after its trained, we test our model on the subset. scikit-learn's bag of tricks comes to the rescue with a simple `train_test_split`

function ("test" is equivalent to "holdout").

```
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print "Train set has %d observations" % len(y_train)
# Train set has 70 observations
print "Holdout set has %d observations" % len(y_test)
# Holdout set has 30 observations
```

The important thing about this is that train and test have no observations in common, so we can't cheat the metric by knowing the answer ahead of time. Let's train another model, but this time without the testing data.

```
t = StandardScaler().fit(X_train)
X_train_scaled = t.transform(X_train) # always regularize your features
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
```

Finally, we can actually make a prediction using the testing data.

```
X_test_scaled = t.transform(X_test) # use the StandardScaler from training
y_test_pred = lr.predict(X_test_scaled)
r2_score(y_test, y_test_pred)
# 0.42967366162304799
```

And similarly, for our K-Neighbors Regression:

```
knn = KNeighborsRegressor(n_neighbors=1).fit(X_train_scaled, y_train)
r2_score(y_test, knn.predict(X_test_scaled))
# -0.12588851798679546
# yes, rsquared can be negative
```

Note: If you run the code above, you'll likely get different $R^2$ values because of the random split of training and holdout data

As expected, both our Linear and K-Neighbors Regressions $R^2$ scores have dropped significantly.

### Cross-Validation: fancy holdouts

Of course as we shrink the sample size our model gets less robust and our results get less reliable. If we train on 70% of our data then test on the following 30%, we've lost some precision on both ends. Shifting the split can improve either our model or our metrics, but not both.

A clever statistician might randomly split the data a few times to determine the reliability of the accuracy measurement. If the splits are done correctly, we might even be able to produce measurements for each observation rather than just 30%.

K-Fold cross validation is the formalization of this idea. It strategically performs train-test splits so every observation is in the test set exactly once, thereby ensuring we get a uncorrupted prediction for each observation.

Yet again, scikit-learn to the rescue.

```
from sklearn.cross_validation import KFold
# Always use shuffle=True to produce random folds. If you need non-random splits, you're probably doing something wrong.
kf = KFold(len(y), n_folds=10, shuffle=True)
```

By choosing `n_folds=10`

, we'll run 10 train-test splits and actually train 10 separate models, one for each split. Each model will then predict on it's own unique hold out set. This means each model gets to train on 90% of the origin data, rather than just 70%.

```
y_pred = np.zeros(len(y), dtype=y.dtype) # where we'll accumulate predictions
lr = LinearRegression()
# train_index and test_index never have the same values, test_indexes never overlap
for train_index, test_index in kf:
# for each iteration of the for loop we'll do a test train split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
t = StandardScaler()
X_train = t.fit_transform(X_train)
lr.fit(X_train, y_train) # Train on the training data
X_test = t.transform(X_test)
y_pred[test_index] = lr.predict(X_test) # Predict using the test and store in y_pred
print r2_score(y, y_pred)
# 0.71816746048
```

Because this metric was created with models trained with a larger percentage of the training data and takes every observation into account, we can assume it's much more accurate than the original "single split" metric.

Finally, we have a somewhat accurate measurement.

### Wrapping things up: Part 1

Hold-out sets and cross-validation are great ways to confirm your model is actually as robust as you think. Some analytics teams take this idea even further and split hold-out sets across departments: models are trained and tested by one group of analysts and then tested again on a hold-out set by an entirely different department.

This might sound a bit bureaucratic but its just another way of ensuring models aren't over-fit or are assumed to be more robust than they actually are. If you're letting financial decisions be driven by a model, it's worth ensuring you've done this correctly.

In part 2 of this post, we'll cover the terrifyingly subtle ways cross validation can go wrong. Stay tuned!

### Appendix: The Plotting Code

My ugly `pyplot`

code for those who want it.

```
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.metrics import r2_score
def plot_r2(y, y_pred, title):
plt.figure(figsize=(10, 6))
plt.grid()
plt.scatter(y, y_pred, marker='.')
plt.xlabel("Actual Target"); plt.ylabel("Predicted Target")
plt.title(title)
xmn, xmx = plt.xlim()
ymn, ymx = plt.ylim()
mx = max(xmx, ymx)
buff = mx * .1
plt.text(xmn + buff, mx - buff, "R2 Score: %f" % (r2_score(y, y_pred), ), size=15)
plt.plot([0., mx], [0., mx])
plt.xlim(xmn, mx)
plt.ylim(ymn, mx)
```