About a week ago I came across the Don’t Be Fooled By Baseball’s Small-Budget Success Stories on FiveThirtyEightSports. It was featured on the ESPN home page so it was pretty hard to miss. I thought it was a great post and it left me wondering: How hard would it be to replicate?
Asking the right question, in this case "Do MLB payrolls matter?" (spoiler: they do), is definitely the difficult part, but the analysis looked familiar enough that I thought I might be able to recreate it using my trusty friend: R. In addition, the article didn't show any part of the data analysis (as it shouldn't, that sort of thing doesn't appeal to the audience they're trying to reach), so I thought replicating it here would be a good way to show people just how you might go about it!
So in this post I'll be showing you how to recreate the Don’t Be Fooled By Baseball’s Small-Budget Success Stories article from the ground up: all the way from sourcing the data and computing the statistics to making your line charts have the team's actual colors.
As mentioned by the author, the data comes from Baseball-Reference.com. It's my favorite source for baseball data as well. They've got records going back as far as 1871. That's before the MLB even existed.
> df <- read.csv("./mlb-standings-and-payroll.csv", stringsAsFactors=FALSE) # the FiveThirtyEight post only looks at the 1985 season to present > df <- df[df$year >= 1985,] > head(df) tm year w wins_losses est_payroll 1 ANA 1997 84 0.518 31135472 2 ANA 1998 85 0.525 41791000 3 ANA 1999 70 0.432 55633166 4 ANA 2000 82 0.506 52664167 5 ANA 2001 75 0.463 47735167 6 ANA 2002 99 0.611 61721667
Munging and Merging
Of course no matter how clean you think your data is, it's never just right. There's always a column that needs normalizing or a field that needs filling.
The first thing I noticed was that I was going to need a way to reconcile all of the team abbreviations. For example, the Washington Nationals (WSN) were actually the Montreal Expos (MON) until 2004. I created a mapping file that mapped abbreviations for old teams and relocated teams to their modern counterparts.
> team.lookups <- read.csv("./team-lookups.csv") > df <- merge(df, team.lookups, by.x="tm", by.y="historic_team", stringsAsFactors=FALSE) > head(df[df$tm=="MON",]) tm year w wins_losses est_payroll modern_team 905 MON 1969 52 0.321 NA WSN 906 MON 1970 73 0.451 NA WSN 907 MON 1971 71 0.441 NA WSN 908 MON 1972 70 0.449 NA WSN 909 MON 1973 79 0.488 NA WSN 910 MON 1974 79 0.491 NA WSN
One of the most professional parts of the FiveThirtyEight charts is that they use the team's actual colors. And for good reason, MLB brands are a valuable asset. The New York Yankees' brand was estimated at a value of \$398 million in 2012 . Unfortunately ggplot doesn't have a baked in MLB color mapping so I had to roll up my sleeves and do this the old fashioned way: using an Excel file.
Good news is that there's already a website that has the corresponding hex color codes for most pro sports teams. From there it's just a matter of grabbing the right ones and plopping them into a file that maps each team name to their primary color.
Ok hard stuff done (seriously, the longest part about writing this post was creating the team-colors.csv mapping file). In the Kansas City Royals chart, the author states that the x-axis is the "standard deviations from average salary". I took this to mean that the x-axis is the number of standard deviations a team's salary is for a given year when compared to the rest of the league's salaries for that year. A bit of a mouthful, but it's a decent way to quantify how abnormally high or low a team's salary is in a given year while also accounting for things like inflation, growth of professional sports in the USA, and owners' willingness to spend.
So to come up with this "standardized salary" number we need the following:
- mean payroll for each year
- payroll standard deviation for each year
Once we have these numbers, we can compare them to the payrolls for each team on a yearly basis. From there we calculate how many standard deviations away from the mean each team's payroll is.
So for example, in 2014 the Oakland A's payroll was \$89,160,900 and the league average payroll was \$115,652,839 with a standard deviation of \$47,884,348. In this circumstance, the A's standardized payroll would be:
($89,160,900 - $115,652,839) / $47,884,348 = -0.55
Great! This means the A's were well below average payroll when compared to the rest of the league. Just for a visual, here's what the rest of the teams come out to in the 2014 season.
Calculating this for every year is relatively simple. We'll use
plyr to split
our data by each year, calculate the mean and standard deviation, and then merge
it back into the original dataset to compute the
for each team/year.
Calculating the curves
I'm not 100% sure about what techniques the FiveThirtyEight authors are using for fitting curves, but I have a good enough understanding that I can make a decent approximation (read that as, my curves may look slightly different for some teams).
It looks like the FiveThirtyEight folks are using a polynomial regression
model to estimate the smoothed curve. This is really easy to do with ggplot. You can
define "smoothing functions" within the
stat_smooth function. For a polynomial
regression with degree 2 (which I'm pretty sure is what's being used here), you use
the following formula:
y ~ poly(x, 2)
When combined with ggplot, your function looks something like this:
df.mets <- subset(df, tm=="NYM") p <- ggplot(subset(df, tm=="NYM"), aes(x=standardized_payroll, y=wins_losses)) + geom_point(alpha=0.75, size=3) + stat_smooth(size=1.5, method="lm", formula = y ~ poly(x, 2), se=FALSE) + scale_x_continuous(name="Standardized Salary\n(# of standard deviations from yearly mean)", breaks=c(-2, 0, 2), limit=c(-2.5, 2.5), labels=c("-2", "0", "+2")) + scale_y_continuous(name="Win/Loss %", breaks=seq(0.3, 0.7, 0.1), limit=c(0.2, 0.8)) + ggtitle("New York Mets\nWin Percentage vs. stgandard deviations from average salary") print(p)
We can overlay the equivalent fitted curve, but for the entire league by using the following code:
p + stat_smooth(data=df, color="grey", size=.5, method="lm", formula = y ~ poly(x, 2), se=FALSE)
The polynomial regression method is a decent enough way to approach things but what's a little concerning is that the quality of the fits aren't indicated.
For example, take a look at the New York Mets' chart. The (R^2) value for that curve is 0.17. That's not very good. So while the curves might make these charts a little easier to look at, don't take them as fact.
Styling the plots
ggthemes is particularly awesome. It
can make your ggplots look like charts in Excel, Wall Street Journal, or even
Tableau. As luck would have it, there's even a theme for FiveThirtyEight--
This means that our life just got a lot easier!
We're going to build subplots for each MLB division. So each of the 6 divisions will have their own .png file at the end (this makes it easier to read).
To do this we're going to use the same code as before, except we'll loop through a list of divisions and filter each plot so that it only plots data for the relevant division.
Combine each division's image plot using something like Preview on OSX or Paint
on Windows. If you're a command-line aficionado, you can use the
function provided by Imagemagick.
The Final Product
Since there's a lot of code and data, I created a repository on GitHub if you want to download everything at once.
FiveThirtyEight did a great job with their post and I think it's encouraging to know that it's not impossible to produce the same high quality content on your own (an Adobe Illustrator license wouldn't hurt either). In the future, I'd love to see more statistical accountability in posts like these--Note that the data from some posts is published on the FiveThirtyEight's GitHub, unfortunately this wasn't one of them.
If you liked this post, here are some other resources you might find helpful or interesting:
- Baseball-Reference.com: MLB Stats, Standings, Scores, and History
- Team Colors: Find and copy color values from your favorite sports teams.
- Original FiveThirtyEight post
- Another other fun ggplot themes
- Combining images progromatically on OSX
- USA Today MLB Salary Database
 Statistica: http://www.statista.com/graphic/1/254950/major-league-baseball-teams-ranked-by-brand-value.jpg