The Yhat Blog


machine learning, data science, engineering


The Code Behind Building a FiveThirtyEight post

by Greg |


About a week ago I came across the Don’t Be Fooled By Baseball’s Small-Budget Success Stories on FiveThirtyEightSports. It was featured on the ESPN home page so it was pretty hard to miss. I thought it was a great post and it left me wondering: How hard would it be to replicate?

Asking the right question, in this case "Do MLB payrolls matter?" (spoiler: they do), is definitely the difficult part, but the analysis looked familiar enough that I thought I might be able to recreate it using my trusty friend: R. In addition, the article didn't show any part of the data analysis (as it shouldn't, that sort of thing doesn't appeal to the audience they're trying to reach), so I thought replicating it here would be a good way to show people just how you might go about it!

So in this post I'll be showing you how to recreate the Don’t Be Fooled By Baseball’s Small-Budget Success Stories article from the ground up: all the way from sourcing the data and computing the statistics to making your line charts have the team's actual colors.

The Data

As mentioned by the author, the data comes from Baseball-Reference.com. It's my favorite source for baseball data as well. They've got records going back as far as 1871. That's before the MLB even existed.

Lucky for me, I had actually already collected all of the data in a previous post. All I needed to do was fire up R and run my script again to refresh the dataset. You can download the dataset here.

> df <- read.csv("./mlb-standings-and-payroll.csv", stringsAsFactors=FALSE)
# the FiveThirtyEight post only looks at the 1985 season to present 
> df <- df[df$year >= 1985,]
> head(df)
    tm year  w wins_losses est_payroll
 1 ANA 1997 84       0.518    31135472
 2 ANA 1998 85       0.525    41791000
 3 ANA 1999 70       0.432    55633166
 4 ANA 2000 82       0.506    52664167
 5 ANA 2001 75       0.463    47735167
 6 ANA 2002 99       0.611    61721667

Munging and Merging

Of course no matter how clean you think your data is, it's never just right. There's always a column that needs normalizing or a field that needs filling.

The first thing I noticed was that I was going to need a way to reconcile all of the team abbreviations. For example, the Washington Nationals (WSN) were actually the Montreal Expos (MON) until 2004. I created a mapping file that mapped abbreviations for old teams and relocated teams to their modern counterparts.

> team.lookups <- read.csv("./team-lookups.csv")
> df <- merge(df, team.lookups, by.x="tm", by.y="historic_team", stringsAsFactors=FALSE)
> head(df[df$tm=="MON",])
      tm year  w wins_losses est_payroll modern_team
 905 MON 1969 52       0.321          NA         WSN
 906 MON 1970 73       0.451          NA         WSN
 907 MON 1971 71       0.441          NA         WSN
 908 MON 1972 70       0.449          NA         WSN
 909 MON 1973 79       0.488          NA         WSN
 910 MON 1974 79       0.491          NA         WSN

Big Head Teddy Roosevelt is a far cry from Youppi.

Colors

One of the most professional parts of the FiveThirtyEight charts is that they use the team's actual colors. And for good reason, MLB brands are a valuable asset. The New York Yankees' brand was estimated at a value of \$398 million in 2012 [1]. Unfortunately ggplot doesn't have a baked in MLB color mapping so I had to roll up my sleeves and do this the old fashioned way: using an Excel file.

Good news is that there's already a website that has the corresponding hex color codes for most pro sports teams. From there it's just a matter of grabbing the right ones and plopping them into a file that maps each team name to their primary color.

Click to enlarge.

Standardizing Payroll

Ok hard stuff done (seriously, the longest part about writing this post was creating the team-colors.csv mapping file). In the Kansas City Royals chart, the author states that the x-axis is the "standard deviations from average salary". I took this to mean that the x-axis is the number of standard deviations a team's salary is for a given year when compared to the rest of the league's salaries for that year. A bit of a mouthful, but it's a decent way to quantify how abnormally high or low a team's salary is in a given year while also accounting for things like inflation, growth of professional sports in the USA, and owners' willingness to spend.

So to come up with this "standardized salary" number we need the following:

  • mean payroll for each year
  • payroll standard deviation for each year

Once we have these numbers, we can compare them to the payrolls for each team on a yearly basis. From there we calculate how many standard deviations away from the mean each team's payroll is.

So for example, in 2014 the Oakland A's payroll was \$89,160,900 and the league average payroll was \$115,652,839 with a standard deviation of \$47,884,348. In this circumstance, the A's standardized payroll would be:

($89,160,900 - $115,652,839) / $47,884,348 = -0.55

Great! This means the A's were well below average payroll when compared to the rest of the league. Just for a visual, here's what the rest of the teams come out to in the 2014 season.

Calculating this for every year is relatively simple. We'll use plyr to split our data by each year, calculate the mean and standard deviation, and then merge it back into the original dataset to compute the standardized_payroll values for each team/year.

Calculating the curves

I'm not 100% sure about what techniques the FiveThirtyEight authors are using for fitting curves, but I have a good enough understanding that I can make a decent approximation (read that as, my curves may look slightly different for some teams).

It looks like the FiveThirtyEight folks are using a polynomial regression model to estimate the smoothed curve. This is really easy to do with ggplot. You can define "smoothing functions" within the stat_smooth function. For a polynomial regression with degree 2 (which I'm pretty sure is what's being used here), you use the following formula:

y ~ poly(x, 2)

When combined with ggplot, your function looks something like this:

df.mets <- subset(df, tm=="NYM")
p <- ggplot(subset(df, tm=="NYM"), aes(x=standardized_payroll, y=wins_losses)) + 
  geom_point(alpha=0.75, size=3) + 
  stat_smooth(size=1.5, method="lm", formula = y ~ poly(x, 2), se=FALSE) +
  scale_x_continuous(name="Standardized Salary\n(# of standard deviations from yearly mean)",
                     breaks=c(-2, 0, 2), limit=c(-2.5, 2.5), labels=c("-2", "0", "+2")) +
  scale_y_continuous(name="Win/Loss %", breaks=seq(0.3, 0.7, 0.1), limit=c(0.2, 0.8)) +
  ggtitle("New York Mets\nWin Percentage vs. stgandard deviations from average salary")
print(p)

We can overlay the equivalent fitted curve, but for the entire league by using the following code:

p + stat_smooth(data=df, color="grey", size=.5, method="lm", formula = y ~ poly(x, 2), se=FALSE)

The polynomial regression method is a decent enough way to approach things but what's a little concerning is that the quality of the fits aren't indicated.

For example, take a look at the New York Mets' chart. The (R^2) value for that curve is 0.17. That's not very good. So while the curves might make these charts a little easier to look at, don't take them as fact.

Need help interpreting this output? Check out our post on Fitting and Interpreting Linear Models in R.

Styling the plots

The ggthemes is particularly awesome. It can make your ggplots look like charts in Excel, Wall Street Journal, or even Tableau. As luck would have it, there's even a theme for FiveThirtyEight--theme_fivethirtyeight(). This means that our life just got a lot easier!

We're going to build subplots for each MLB division. So each of the 6 divisions will have their own .png file at the end (this makes it easier to read).

To do this we're going to use the same code as before, except we'll loop through a list of divisions and filter each plot so that it only plots data for the relevant division.

Breakout of the AL East

Combine each division's image plot using something like Preview on OSX or Paint on Windows. If you're a command-line aficionado, you can use the convert function provided by Imagemagick.

The Final Product

And voilà! You're ready to go start writing for FiveThirtyEight.

Click to view full image.

Final Thoughts

Since there's a lot of code and data, I created a repository on GitHub if you want to download everything at once.

FiveThirtyEight did a great job with their post and I think it's encouraging to know that it's not impossible to produce the same high quality content on your own (an Adobe Illustrator license wouldn't hurt either). In the future, I'd love to see more statistical accountability in posts like these--Note that the data from some posts is published on the FiveThirtyEight's GitHub, unfortunately this wasn't one of them.

If you liked this post, here are some other resources you might find helpful or interesting:

[1] Statistica: http://www.statista.com/graphic/1/254950/major-league-baseball-teams-ranked-by-brand-value.jpg



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.