The Yhat Blog


machine learning, data science, engineering


Scraping and Analyzing Baseball Data with R

by Greg |


We get a lot of emails from people who are interested in analyzing sports data. The usual suspects are moneyball types--SABRmetrics enthusiasts with a love of baseball and a penchant for R. Luckily for us, baseball data is very accessible. The MLB even goes as far as to make low level details on every pitch publicly available.

In this post, I'm going to show you how you can scrape your own baseball data in R and then use it to determine how important winning is for teams' game attendance.

The Data

The data we're using comes from baseball-reference.com. Baseball Reference has data going all the way back to the 1800's for just about every baseball related stat you can think of. What's also great (especially for this exercise) is that data is all in tabular format, so it's easy to translate into a data.frame.

We're going to be scraping attendance and standings data in an attempt to determine how much winning impacts attendance.

Scraping with R

Scraping is really easy with R. The simplest and most effective package I've used is XML. Just pass the XML::readHTMLTable function a URL and it will download the page and return any tables it finds.

Wow that was really easy! All we need to do is select the right table from the list returned and we're good to go.

Grabbing multiple years

It would be pretty tedious if we had to run this on each year individually. Luckily for us the URL structure is very consistent (and obvious). I'm going to write a function that parameterizes the year.

Bam! Now here's the extremely satisfying and fun part. We're going to use ldply from the plyr package to download data from 1950 to 2010 and aggregate it all together.

Check out that progress bar! Pretty gratifying, right?

Casting columns

One thing that is a little annoying is that some numerical columns contain punctuation marks like "," and "$". We'll need to clean these up a little so that R will recognize them as numbers. To do this, I'll write a quick function called make_numeric that will do exactly that.

Analyzing the data

Ok things are shaping up here. I did the same scraping exercise for standings data as well (code omitted here but can be found on github), so I've got 2 data frames that are primed and ready to rip.

The first thing I'll do is merge them together. That way I can have wins, losses, and attendance all in the same place. This is like doing a join in SQL. I'm going to be merging (or joining) using the team name and the year.

df <- merge(standings, attendance, by=c("tm", "year"))

Great! Now let's compare wins and attendance.

ggplot(df, aes(x=w, y=attendance)) + geom_point()

It looks like there might be something there. Instead of looking at total attendance, let's look at average attendance. This will account for seasons that pre-dated the 162 game season.

Still looks ok. What about trying to fit a basic linear model.

The simple model tells us that one win is worth an additional 297.9 fans. My gut tells me this seems reasonable. For example, this would say that the 57 win Pittsburgh Pirates would draw 11,916 less fans per game than the 97 win Philadelphia Phillies.

But this brings up an interesting point? Shouldn't we be controlling for teams as well? Especially when you consider that each stadium can accommodate a different number of fans. As a quick and dirty way to investigate this, I added team as a variable in the model. You can also see I added 0, which just means that the regression won't have an intercept. This might seem weird, but it's actually handy. Each team will now be assigned a coefficient which we can use as sort of an offset for stadium capacity.

summary(lm(attend_per_game ~ 0 + w + tm, data=df))

Click to open in a new window

You can see that our estimate for how much a win is worth doesn't change much, 267.7. In addition you can see the different coefficients serve as adjustments for how many people that team will draw. For example, the Houston Astros have a coefficient of 1221.5, which means that the expected attendance for the Astros is

Attendance = 1221.5 + (# of wins)*267.7

What this means is if the Astros have a good year and win 100 games, they will average about 27,992 fans a game. And if they have a terrible season and win 50 games, they'll average about 14,606 fans per game. Sadly the latter scenario has been the case for the past few years (though I think they've finally turned the corner).

Let's take the 2014 season as an example. This year the Astros won 70 games and averaged 21,627 fans per game. Our model predicted 19,961 fans per game. Not too shabby!

Final Thoughts

Unfortunately our data didn't come with the stadium name. It would have been nice to know when teams move to stadiums with different capacities. But maybe that's a topic for a future post!



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.