The Yhat Blog


machine learning, data science, engineering


Grouping in pandas

by Ross |


Intro

Grouping data is an integral part of many data analysis projects. The functionality for grouping in pandas is vast, but can be tough to grasp initially. Have no fear...we will get through a short introduction together using some data from NYC's beloved bike share program, Citi Bike.

Why would I want to group data?

Most of the time, you have a set of data that lends itself to being categorized or grouped. As a general example, let's say we have data on a wide variety of people. We may perform an analysis where we compare groups in the data based on age, gender, birth month, shoe size, or birth city; the options are as numerous as the data points!

The pandas groupby functionality draws from the Split-Apply-Combine method as described by Hadley Wickham from the land of R. It's a great approach to solving data analysis problems, and his paper on the subject is worth a read (it's linked in the resources section).

To summarize, he states that a common methodology for analyzing data comes from splitting the data into categories or groups based on some criteria, applying some aggregation function to each group (sum, mean, count), then combining the results for analysis, visualization or other means of better understanding.

Here's a graphic I came across illustrating the process:

From Data Analysis in Python by Wes McKinney

Sounds handy, but how do I do it in pandas?

Codes and Stuff

Just so we're on the same page, I’m running pandas 0.18.1.

I was looking around for an intriguing dataset and came across this data from Citi Bike, which is the NYC bike share program. It's pretty medium data at ~250MB CSV for one month's worth of data, and there was potential for some compelling findings with data grouping. Let's start down the rabbit hole...

df = pd.read_csv('data.csv')
df.head()
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 538 5/1/2016 00:00:03 5/1/2016 00:09:02 536 1 Ave & E 30 St 40.741444 -73.975361 497 E 17 St & Broadway 40.737050 -73.990093 23097 Subscriber 1986.0 2
1 224 5/1/2016 00:00:04 5/1/2016 00:03:49 361 Allen St & Hester St 40.716059 -73.991908 340 Madison St & Clinton St 40.712690 -73.987763 23631 Subscriber 1977.0 1
2 328 5/1/2016 00:00:14 5/1/2016 00:05:43 301 E 2 St & Avenue B 40.722174 -73.983688 311 Norfolk St & Broome St 40.717227 -73.988021 23049 Subscriber 1980.0 1
3 1196 5/1/2016 00:00:20 5/1/2016 00:20:17 3141 1 Ave & E 68 St 40.765005 -73.958185 237 E 11 St & 2 Ave 40.730473 -73.986724 19019 Customer NaN 0
4 753 5/1/2016 00:00:26 5/1/2016 00:13:00 492 W 33 St & 7 Ave 40.750200 -73.990931 228 E 48 St & 3 Ave 40.754601 -73.971879 16437 Subscriber 1981.0 1

It looks like there is a good opportunity to break the data down into groups to look for some interesting trends. Some ideas are:

  • Group on the gender column and see if there are more male or female riders.
  • Do specific stations get used more than others? We can group on the station start or finish id.
  • Group the data on the day of the week, to see if there is more utilization for a particular day, on average.

How about a few examples?

If we want to group by just the gender, then we pass this key (column name) to the groupby function as the sole argument. This example is the simplest form of grouping, so please check out the docs to get all the options!

groupedGender = df.groupby('gender')
print groupedGender
<pandas.core.groupby.DataFrameGroupBy object at 0x1133854d0>

The output shows that groupby returns a pandas DataFrameGroupBy object. Pandas has just made some internal calculations about the new gender groups and is ready to apply some operation on each of these groups.

We can take a look at the available methods with the docstring/tab complete functionality of Rodeo!

Counts of groups

Getting back to the data, if we use the count method, we can see the total number of entries for each gender group. For reference, here's what the website says for the gender codes - "Gender (Zero=unknown; 1=male; 2=female)"

groupedGender.size()
gender
0    178710
1    783723
2    249847

# look at the size as a percentage of the whole (using the trip)
total = df.gender.count()
groupedGender.size() / total * 100
gender
0    14.741644
1    64.648679
2    20.609678

It looks like males make up the majority of Citi Bike riders (~65%). I was pretty surprised to see that male riders outnumbered female riders 3 to 1. I wonder if that's true of commuters in general, or if there's some other factor, like females tending to own their own bikes. A question for another post...

Mean and Std Dev of groups

We can use a single column from the DataFrameGroupBy object and apply some aggregation function on it - how about the mean and standard deviation of the trip durations for all three groups?

groupedGender['tripduration'].mean() / 60.
gender
0    35.923658
1    13.778720
2    16.198230

# Don't have to use the bracket notation
groupedGender.tripduration.std() / 60.
gender
0    193.417686
1     94.884313
2     91.675397

Although males make up the majority of Citi Bike riders, there's not much of a difference in their trip durations. Interestingly, gender unknown riders take 2x as long of rides on average. These riders are likely single-use customers (when you purchase a one time pass at a Citi Bike kiosk you are not asked for your gender).

More summary statistics

So there are some summary statistics for these groups (as an aside, you can use the describe function to get these statistics and more in one call). That's a whole lot of spread around the mean, which probably means there are some outliers in the data (maybe people that kept the bike for days). Just a brief look at this even though it's outside the scope, because I'm sure you were all interested 😊

df[df.tripduration > 10000].tripduration.count()
5110

Our suspicions are confirmed - there are many bike rentals outside 2:45 even though the "max" is supposed to be 30 minutes (or 45 if you're a Citi Bike member).

Brief aside / public service announcement, it costs up to $1200 to replace a Citi Bike. Don't be an outlier!

A quick (gg)plot

Okay, back on track...the plot, just because we can (using ggplot of course):

df_short = df[df.tripduration < 10000]
df_short.tripduration = df_short.tripduration / 60.
ggplot(df_short, aes(x='tripduration')) + geom_histogram(bins=30) + xlab("Trip Duration (mins)") + ylab("Count")

Grouping by station name

One last example is looking at which are the five favorite start and end stations. We'll group the data based on the start and end station names, apply the count function, and sort the values is descending order. Here's the code for that:

groupedStart = df.groupby('start station name')
groupedStart['start station name'].count().sort_values(ascending=False)[:5]

start station name
Pershing Square North    12775
West St & Chambers St    10128
Lafayette St & E 8 St     9246
W 21 St & 6 Ave           9220
E 17 St & Broadway        9036

groupedEnd = df.groupby('end station name')
groupedEnd['end station name'].count().sort_values(ascending=False)[:5]

end station name
Pershing Square North    12511
West St & Chambers St    10189
Lafayette St & E 8 St     9459
E 17 St & Broadway        9273
W 21 St & 6 Ave           9268

Wrapping Up

Hopefully, the above examples helped introduce some basic uses for the grouping process in pandas to help enhance your analysis and whet your appetite for more! What ideas do you have for further analysis on this dataset? Can you conquer the last idea of looking at the days of the week? Please, take a look at the resources linked below for further investigation!

More Resources



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.