# Fast summary statistics in R with data.table

#### by Jeff |

Calculating summary measures (e.g. sums, counts, means, min, max, percentiles, etc.) across groups of data is one of the most common tasks in data analysis. Unless you're among the poor souls stuck with Hadoop, the right tool for the job could be a SQL GROUP BY, an Excel Pivot Table, or, if you're like me, a scripting language!

This is a post about `data.table`

, a popular package for summarizing data in `R`

.

#### Installation

Like most `R`

packages, installation is a snap. Just use `install.packages`

.

`install.packages("data.table")`

### How to make a data.table?

Ok so you've got `data.table`

installed. What can you do with it? Well for starters, you should probably know how to create a `data.table`

. I usually make a `data.table`

from a `data.frame`

, but you can also create a `data.table`

from vectors. It's pretty much the same things as making a `data.frame`

.

library(data.table) # Making a data.frame from vectors df.data <- data.frame(a=1:5,b=1:5) # Making a data.table from vectors dt.data <- data.table(a=1:5,b=1:5) # Making a data.table from a data.frame dt.data <- data.table(df.data) dt.data # a b #1 1 1 #2 2 2 #3 3 3 #4 4 4 #5 5 5

### Usage & Examples

#### Data Set

First step is to pull in some data. For this example, we'll be using two data sets from Basketball Reference: boxscore.csv and team_game_count.csv. If you want to throw it into `MySQL`

, you can use this loadscript. The data comes from boxscores from the 2004-2013 NBA season. For simplicity, I'm only going to use the fields: `date`

, `team`

, `player`

, `minutes`

, `field goals made (fgm)`

, `field goals attempted (fga)`

, and `points`

. Note that I've converted the minutes field to decimal form, so 1.5 minutes is 1 minute and 30 seconds. Since I threw my data into `MySQL`

, I used `RMySQL`

to pull it out.

library(RMySQL) con <- dbConnect(drv="MySQL", dbname="basketball", user="{USERNAME}", password="{PASSWORD}") strQuery <- "SELECT date, team, player, minutes, fgm, fga, points FROM basketball.boxscore" df.boxscore <- dbGetQuery(con, strQuery) df.boxscore$date <- as.Date(df.boxscore$date, '%Y-%m-%d') head(df.boxscore) #date team player minutes fgm fga points #1 2012-10-30 WAS A.J. Price 29.40 2 13 7 #2 2012-10-30 WAS Emeka Okafor 24.58 4 10 10 #3 2012-10-30 WAS Trevor Ariza 24.58 3 8 9 #4 2012-10-30 WAS Bradley Beal 21.55 2 8 8 #5 2012-10-30 WAS Trevor Booker 16.75 2 9 4 #6 2012-10-30 WAS Jordan Crawford 28.70 4 13 11

One other data point I wanted to get was the number of games a team had played up to a certain point in the season. Since my boxscore data has team and date, this had all of the data I needed.

strQuery <- "SELECT * FROM basketball.team_game_count" df.games <- dbGetQuery(con, strQuery) df.games$date <- as.Date(df.games$date, '%Y-%m-%d')

`df.games`

is a `data.frame`

of every NBA game and tells me how many games a team has played up to a certain point in the season. `game_num`

will be 1 for a team's first game of the season and 82 for its last game (assuming the team plays 82 games in a season).

head(df.games) season date team game_num 1 2005-2006 2005-11-01 DAL 1 2 2005-2006 2005-11-01 DEN 1 3 2005-2006 2005-11-01 MIL 1 4 2005-2006 2005-11-01 NOK 1 5 2005-2006 2005-11-01 PHI 1 6 2005-2006 2005-11-01 PHO 1

#### Comparing plyr vs. data.table

Now I have my data, so let's dig into it a bit and show some examples of where `data.tables`

can replace the `plyr`

package. First, let's get team-level stats instead of player-level stats. Easy, I can just group df.boxscore by date + team and sum the fgm, fga, and points variables. This is easy to accomplish using `ddply`

from the `plyr`

package, but it's also easy to do using `data.tables`

.

Our first file has player-specific stats like the number of points made by per game per player.

Let's compute team-level stats; first we'll use `ddply`

.

library(plyr) system.time( df.statsTeam <- ddply(df.boxscore, ~ date + team, summarize, points=sum(points), fgm=sum(fgm), fga=sum(fga) ) ) user system elapsed 29.990 18.164 48.174

Ugh, that was slow. This operation is equally simple to perform w/ `data.table`

.

Create a `data.table`

by passing a `data.frame`

to the `data.table`

function.

dt.boxscore <- data.table(df.boxscore)

Then I set its key -- this is something I haven't mentioned yet, but you can find more information about keys on the data.table intro page. As you might expect, the key sorts the data.table so that R can access the data more efficiently. I'll set the key to the variables I want to group by -- `date`

and `team`

.

setkey(dt.boxscore, date, team)

```
system.time(
dt.statsTeam <- dt.boxscore[,list(points=sum(points),
fgm=sum(fgm),
fga=sum(fga)),
by=list(date,team)
]
)
# user system elapsed
# 0.095 0.003 0.098
```

Wow, that was fast! `data.table`

took 0.098 seconds while ddply took 48.2 seconds. But is the output the same? I'll have to convert `dt.statsTeam`

to a `data.frame`

in order to compare it to df.statsTeam using the identical() function. That is easily accomplished by calling `data.frame( {the data table} )`

.

identical(df.statsTeam, data.frame(dt.statsTeam)) # [1] TRUE

Yep, they're the same! Ok great, so we can use a `data.table`

as a near drop in replacement for `ddply`

.

### A Basic Merge

Say I only want to find the boxscores from the Houston Rockets' last game of the 2012/2013 regular season. Our first step is to just subset our dataset so that we're only looking at Rockets' data. Let's see what that would look like in regular `R`

.

df.games.houston <- df.games[df.games$team == 'HOU' & df.games$game_num == 82 & df.games$season == '2012-2013',] df.games.houston # season date team game_num #19179 2012-2013 2013-04-17 HOU 82 ## Merge of the two data frames system.time( df.merge <- merge(df.boxscore, df.games.houston, by=c('date','team')) ) # user system elapsed # 3.392 0.201 3.664

Now to merge on `data.tables`

we can either use the merge() function on the `data.tables`

or we can use a unique data.table function.

## First let's just try merging on the data tables. dt.games.houston <- data.table(df.games.houston); dt.boxscore <- data.table(df.boxscore) system.time( dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team')) ) # user system elapsed # 0.092 0.013 0.106

### Even Faster Merges

Using `merge`

was pretty fast, but let's try setting keys on our data before we merge.

setkey(dt.games.houston, date, team) setkey(dt.boxscore, date, team) system.time( dt.merge <- merge(dt.boxscore, dt.games.houston, by=c('date','team')) ) # user system elapsed # 0.018 0.002 0.022

Even faster! Finally, I'll use the internal data.table function for merging. It's a bit funky, but you can do the same as an inner, left, or right join. Since we've already set the keys for both data tables, it knows what fields we want to merge on (date and team). Here is an inner join.

system.time( dt.merge <- dt.boxscore[dt.games.houston] ) # user system elapsed # 0.008 0.001 0.011

### Some Summary Statistics

Okay, now that I've shown you some of the great things about data tables, let's calculate some interesting stats with the data. I'll be using my original `data.table`

here called `dt.boxscore`

.

I'll normalize the points variable by calculating points per 48 minutes (the number of minutes in an NBA game) and then calculate some summary statistics for the 2012-2013 season for each player -- data tables will be useful here.

dt.boxscore$pp48 <- dt.boxscore$points/dt.boxscore$minutes*48 dt.boxscore.subset <- dt.boxscore[dt.boxscore$date >= '2012-09-01',] dt.statsPlayer <- dt.boxscore.subset[,list(minutes=sum(minutes), mean=mean(pp48), min=min(pp48), lower=quantile(pp48, .25, na.rm=TRUE), middle=quantile(pp48, .50, na.rm=TRUE), upper=quantile(pp48, .75, na.rm=TRUE), max=max(pp48)), by='player'] dt.statsPlayer player minutes mean min lower middle upper max 1: Daequan Cook 476.37 9.810285 0.00000 0.00000 0.00000 18.11817 66.97674 2: Robert Sacre 207.72 5.735073 0.00000 0.00000 0.00000 10.68816 38.01980 3: Jared Jeffries 349.84 5.619368 0.00000 0.00000 0.00000 11.64374 23.70370 4: Jared Cunningham 25.70 20.313846 0.00000 0.00000 0.00000 39.60131 75.39267 5: Kevin Murphy 52.39 7.397563 0.00000 0.00000 0.00000 0.00000 42.29075 --- 466: James Harden 3227.72 32.310256 12.57485 26.12344 32.28516 38.15972 54.24410 467: LeBron James 3837.39 33.111033 16.88159 28.30721 33.19206 37.20936 53.63128 468: Kevin Durant 3603.38 34.830615 18.31624 30.20330 34.91109 38.75855 51.85185 469: Kobe Bryant 3012.59 33.665142 0.00000 25.67263 36.18276 40.76094 53.66972 470: Carmelo Anthony 2962.78 36.730475 10.66140 30.92032 36.38086 43.00658 62.82723

Let's only look at players who played over 1,000 minutes in the season (no special reason for 1,000 minutes...it just seemed like a good number to me). And then let's order by median (field is called `middle`

) descending and show the top 10 players. If you follow basketball at all, the top 10 players here make sense.

dt.statsPlayer.1000 <- dt.statsPlayer[dt.statsPlayer$minutes >= 1000, ] dt.top10 <- head(dt.statsPlayer.1000[order(-dt.statsPlayer.1000$middle),], 10) dt.top10 player minutes mean min lower middle upper max 1: Carmelo Anthony 2962.78 36.73048 10.661402 30.92032 36.38086 43.00658 62.82723 2: Kobe Bryant 3012.59 33.66514 0.000000 25.67263 36.18276 40.76094 53.66972 3: Kevin Durant 3603.38 34.83062 18.316241 30.20330 34.91109 38.75855 51.85185 4: LeBron James 3837.39 33.11103 16.881594 28.30721 33.19206 37.20936 53.63128 5: James Harden 3227.72 32.31026 12.574850 26.12344 32.28516 38.15972 54.24410 6: Russell Westbrook 2928.55 32.12473 13.445378 26.45569 32.22371 36.74335 55.01433 7: Kyrie Irving 2047.88 30.70178 9.076584 22.59503 30.85990 38.09608 51.90094 8: Tony Parker 2938.78 28.76046 3.772102 23.31282 29.57983 34.06920 53.08598 9: Brook Lopez 2516.70 30.51248 16.520076 25.24544 28.80000 36.10256 53.43570 10: Stephen Curry 3479.12 28.29065 7.902535 22.89263 28.04535 34.24703 54.00000

### Other resources

Hopefully this gives you a brief overview of how great `data.tables`

are. I'll follow-up with another post that provides some more interesting NBA analysis. Let us know if there is any specific analysis you would be interested in seeing! If you're interested in reading more `data.tables`

, check out these resources:

- Introduction to the data.table package in R by Matthew Dowle
- Elegant & fast data manipulation with data.table by Carl Boettiger
- Data.table rocks! Data manipulation the fast way in R by Markus Gesmann
- Transforming subsets of data in R with by, ddply and data.table by Markus Gesmann
- Comparison of ave, ddply and data.table by Tal Galili
- Analyzing Subsets of Data in R by Charlie Gibbons