# The Code Behind Building a FiveThirtyEight post

#### by Greg |

About a week ago I came across the Don’t Be Fooled By Baseball’s Small-Budget Success Stories on FiveThirtyEightSports. It was featured on the ESPN home page so it was pretty hard to miss. I thought it was a great post and it left me wondering: How hard would it be to replicate?

Asking the right question, in this case "Do MLB payrolls matter?" (*spoiler: they do*), is definitely the
difficult part, but the analysis looked familiar enough that I thought I might be
able to recreate it using my trusty friend: R. In addition, the article didn't
show any part of the data analysis (as it shouldn't, that sort of thing doesn't
appeal to the audience they're trying to reach), so I thought replicating it here would
be a good way to show people just how you might go about it!

So in this post I'll be showing you how to recreate the *Don’t Be Fooled By Baseball’s Small-Budget Success Stories* article
from the ground up: all the way from sourcing the data and computing the statistics to making your line charts have the team's actual colors.

### The Data

As mentioned by the author, the data comes from Baseball-Reference.com. It's my favorite source for baseball data as well. They've got records going back as far as 1871. That's before the MLB even existed.

Lucky for me, I had actually already collected all of the data in a previous post. All I needed to do was fire up R and run my script again to refresh the dataset. You can download the dataset here.

```
> df <- read.csv("./mlb-standings-and-payroll.csv", stringsAsFactors=FALSE)
# the FiveThirtyEight post only looks at the 1985 season to present
> df <- df[df$year >= 1985,]
> head(df)
tm year w wins_losses est_payroll
1 ANA 1997 84 0.518 31135472
2 ANA 1998 85 0.525 41791000
3 ANA 1999 70 0.432 55633166
4 ANA 2000 82 0.506 52664167
5 ANA 2001 75 0.463 47735167
6 ANA 2002 99 0.611 61721667
```

### Munging and Merging

Of course no matter how clean you think your data is, it's never *just right*.
There's always a column that needs normalizing or a field that needs filling.

The first thing I noticed was that I was going to need a way to reconcile all of the team abbreviations. For example, the Washington Nationals (WSN) were actually the Montreal Expos (MON) until 2004. I created a mapping file that mapped abbreviations for old teams and relocated teams to their modern counterparts.

```
> team.lookups <- read.csv("./team-lookups.csv")
> df <- merge(df, team.lookups, by.x="tm", by.y="historic_team", stringsAsFactors=FALSE)
> head(df[df$tm=="MON",])
tm year w wins_losses est_payroll modern_team
905 MON 1969 52 0.321 NA WSN
906 MON 1970 73 0.451 NA WSN
907 MON 1971 71 0.441 NA WSN
908 MON 1972 70 0.449 NA WSN
909 MON 1973 79 0.488 NA WSN
910 MON 1974 79 0.491 NA WSN
```

### Colors

One of the most professional parts of the FiveThirtyEight charts is that they use the team's actual colors. And for good reason, MLB brands are a valuable asset. The New York Yankees' brand was estimated at a value of \$398 million in 2012 [1]. Unfortunately ggplot doesn't have a baked in MLB color mapping so I had to roll up my sleeves and do this the old fashioned way: using an Excel file.

Good news is that there's already a website that has the corresponding hex color codes for most pro sports teams. From there it's just a matter of grabbing the right ones and plopping them into a file that maps each team name to their primary color.

### Standardizing Payroll

Ok hard stuff done (seriously, the longest part about writing this post was creating
the team-colors.csv mapping file). In the Kansas City Royals chart, the author
states that the x-axis is the **"standard deviations from average salary"**.
I took this to mean that the x-axis is the number of standard
deviations a team's salary is for a given year when compared to the rest of
the league's salaries for that year. A bit of a mouthful, but it's a decent
way to quantify how abnormally high or low a team's salary is in a given year
while also accounting for things like inflation, growth of professional sports
in the USA, and owners' willingness to spend.

So to come up with this "standardized salary" number we need the following:

- mean payroll for each year
- payroll standard deviation for each year

Once we have these numbers, we can compare them to the payrolls for each team on a yearly basis. From there we calculate how many standard deviations away from the mean each team's payroll is.

So for example, in 2014 the Oakland A's payroll was \$89,160,900 and the league average payroll was \$115,652,839 with a standard deviation of \$47,884,348. In this circumstance, the A's standardized payroll would be:

```
($89,160,900 - $115,652,839) / $47,884,348 = -0.55
```

Great! This means the A's were well below average payroll when compared to the rest of the league. Just for a visual, here's what the rest of the teams come out to in the 2014 season.

Calculating this for every year is relatively simple. We'll use `plyr`

to split
our data by each year, calculate the mean and standard deviation, and then merge
it back into the original dataset to compute the `standardized_payroll`

values
for each team/year.

### Calculating the curves

I'm not 100% sure about what techniques the FiveThirtyEight authors are using for
fitting curves, but I have a good enough understanding that I can make a decent
approximation (read that as, my curves may look *slightly* different for some teams).

It looks like the FiveThirtyEight folks are using a polynomial regression
model to estimate the smoothed curve. This is really easy to do with ggplot. You can
define "smoothing functions" within the `stat_smooth`

function. For a polynomial
regression with degree 2 (which I'm pretty sure is what's being used here), you use
the following formula:

```
y ~ poly(x, 2)
```

When combined with ggplot, your function looks something like this:

```
df.mets <- subset(df, tm=="NYM")
p <- ggplot(subset(df, tm=="NYM"), aes(x=standardized_payroll, y=wins_losses)) +
geom_point(alpha=0.75, size=3) +
stat_smooth(size=1.5, method="lm", formula = y ~ poly(x, 2), se=FALSE) +
scale_x_continuous(name="Standardized Salary\n(# of standard deviations from yearly mean)",
breaks=c(-2, 0, 2), limit=c(-2.5, 2.5), labels=c("-2", "0", "+2")) +
scale_y_continuous(name="Win/Loss %", breaks=seq(0.3, 0.7, 0.1), limit=c(0.2, 0.8)) +
ggtitle("New York Mets\nWin Percentage vs. stgandard deviations from average salary")
print(p)
```

We can overlay the equivalent fitted curve, but for the entire league by using the following code:

```
p + stat_smooth(data=df, color="grey", size=.5, method="lm", formula = y ~ poly(x, 2), se=FALSE)
```

The polynomial regression method is a decent enough way to approach things but what's a little concerning is that the quality of the fits aren't indicated.

For example, take a look at the New York Mets' chart. The (R^2) value for that curve is 0.17. That's not very good. So while the curves might make these charts a little easier to look at, don't take them as fact.

### Styling the plots

The `ggthemes`

is particularly awesome. It
can make your ggplots look like charts in Excel, Wall Street Journal, or even
Tableau. As luck would have it, there's even a theme for FiveThirtyEight--`theme_fivethirtyeight()`

.
This means that our life just got *a lot* easier!

We're going to build subplots for each MLB division. So each of the 6 divisions will have their own .png file at the end (this makes it easier to read).

To do this we're going to use the same code as before, except we'll loop through a list of divisions and filter each plot so that it only plots data for the relevant division.

Combine each division's image plot using something like Preview on OSX or Paint
on Windows. If you're a command-line aficionado, you can use the `convert`

function provided by Imagemagick.

### The Final Product

And voilà! You're ready to go start writing for FiveThirtyEight.

### Final Thoughts

Since there's a lot of code and data, I created a repository on GitHub if you want to download everything at once.

FiveThirtyEight did a great job with their post and I think it's encouraging to know that it's not impossible to produce the same high quality content on your own (an Adobe Illustrator license wouldn't hurt either). In the future, I'd love to see more statistical accountability in posts like these--Note that the data from some posts is published on the FiveThirtyEight's GitHub, unfortunately this wasn't one of them.

If you liked this post, here are some other resources you might find helpful or interesting:

- Baseball-Reference.com: MLB Stats, Standings, Scores, and History
- Team Colors: Find and copy color values from your favorite sports teams.
- Original FiveThirtyEight post
- Another other fun ggplot themes
- Combining images progromatically on OSX
- USA Today MLB Salary Database

[1] Statistica: http://www.statista.com/graphic/1/254950/major-league-baseball-teams-ranked-by-brand-value.jpg