Grouping data is an integral part of many data analysis projects. The functionality for grouping in pandas is vast, but can be tough to grasp initially. Have no fear...we will get through a short introduction together using some data from NYC's beloved bike share program, Citi Bike.
Why would I want to group data?
Most of the time, you have a set of data that lends itself to being categorized or grouped. As a general example, let's say we have data on a wide variety of people. We may perform an analysis where we compare groups in the data based on age, gender, birth month, shoe size, or birth city; the options are as numerous as the data points!
groupby functionality draws from the
Split-Apply-Combine method as described by Hadley Wickham from the land of R. It's a great approach to solving data analysis problems, and his paper on the subject is worth a read (it's linked in the resources section).
To summarize, he states that a common methodology for analyzing data comes from splitting the data into categories or groups based on some criteria, applying some aggregation function to each group (sum, mean, count), then combining the results for analysis, visualization or other means of better understanding.
Here's a graphic I came across illustrating the process:
Sounds handy, but how do I do it in pandas?
Codes and Stuff
Just so we're on the same page, I’m running pandas
I was looking around for an intriguing dataset and came across this data from Citi Bike, which is the NYC bike share program. It's pretty medium data at ~250MB CSV for one month's worth of data, and there was potential for some compelling findings with data grouping. Let's start down the rabbit hole...
df = pd.read_csv('data.csv') df.head()
|tripduration||starttime||stoptime||start station id||start station name||start station latitude||start station longitude||end station id||end station name||end station latitude||end station longitude||bikeid||usertype||birth year||gender|
|0||538||5/1/2016 00:00:03||5/1/2016 00:09:02||536||1 Ave & E 30 St||40.741444||-73.975361||497||E 17 St & Broadway||40.737050||-73.990093||23097||Subscriber||1986.0||2|
|1||224||5/1/2016 00:00:04||5/1/2016 00:03:49||361||Allen St & Hester St||40.716059||-73.991908||340||Madison St & Clinton St||40.712690||-73.987763||23631||Subscriber||1977.0||1|
|2||328||5/1/2016 00:00:14||5/1/2016 00:05:43||301||E 2 St & Avenue B||40.722174||-73.983688||311||Norfolk St & Broome St||40.717227||-73.988021||23049||Subscriber||1980.0||1|
|3||1196||5/1/2016 00:00:20||5/1/2016 00:20:17||3141||1 Ave & E 68 St||40.765005||-73.958185||237||E 11 St & 2 Ave||40.730473||-73.986724||19019||Customer||NaN||0|
|4||753||5/1/2016 00:00:26||5/1/2016 00:13:00||492||W 33 St & 7 Ave||40.750200||-73.990931||228||E 48 St & 3 Ave||40.754601||-73.971879||16437||Subscriber||1981.0||1|
It looks like there is a good opportunity to break the data down into groups to look for some interesting trends. Some ideas are:
- Group on the gender column and see if there are more male or female riders.
- Do specific stations get used more than others? We can group on the station start or finish id.
- Group the data on the day of the week, to see if there is more utilization for a particular day, on average.
How about a few examples?
If we want to group by just the gender, then we pass this key (column name) to the
groupby function as the sole argument. This example is the simplest form of grouping, so please check out the docs to get all the options!
groupedGender = df.groupby('gender') print groupedGender <pandas.core.groupby.DataFrameGroupBy object at 0x1133854d0>
The output shows that
groupby returns a pandas
DataFrameGroupBy object. Pandas has just made some internal calculations about the new gender groups and is ready to apply some operation on each of these groups.
We can take a look at the available methods with the docstring/tab complete functionality of Rodeo!
Counts of groups
Getting back to the data, if we use the
count method, we can see the total number of entries for each gender group. For reference, here's what the website says for the gender codes - "Gender (Zero=unknown; 1=male; 2=female)"
groupedGender.size() gender 0 178710 1 783723 2 249847 # look at the size as a percentage of the whole (using the trip) total = df.gender.count() groupedGender.size() / total * 100 gender 0 14.741644 1 64.648679 2 20.609678
It looks like males make up the majority of Citi Bike riders (~65%). I was pretty surprised to see that male riders outnumbered female riders 3 to 1. I wonder if that's true of commuters in general, or if there's some other factor, like females tending to own their own bikes. A question for another post...
Mean and Std Dev of groups
We can use a single column from the DataFrameGroupBy object and apply some aggregation function on it - how about the mean and standard deviation of the trip durations for all three groups?
groupedGender['tripduration'].mean() / 60. gender 0 35.923658 1 13.778720 2 16.198230 # Don't have to use the bracket notation groupedGender.tripduration.std() / 60. gender 0 193.417686 1 94.884313 2 91.675397
Although males make up the majority of Citi Bike riders, there's not much of a difference in their trip durations. Interestingly, gender unknown riders take 2x as long of rides on average. These riders are likely single-use customers (when you purchase a one time pass at a Citi Bike kiosk you are not asked for your gender).
More summary statistics
So there are some summary statistics for these groups (as an aside, you can use the
describe function to get these statistics and more in one call). That's a whole lot of spread around the mean, which probably means there are some outliers in the data (maybe people that kept the bike for days). Just a brief look at this even though it's outside the scope, because I'm sure you were all interested 😊
df[df.tripduration > 10000].tripduration.count() 5110
Our suspicions are confirmed - there are many bike rentals outside 2:45 even though the "max" is supposed to be 30 minutes (or 45 if you're a Citi Bike member).
Brief aside / public service announcement, it costs up to $1200 to replace a Citi Bike. Don't be an outlier!
A quick (gg)plot
Okay, back on track...the plot, just because we can (using ggplot of course):
df_short = df[df.tripduration < 10000] df_short.tripduration = df_short.tripduration / 60. ggplot(df_short, aes(x='tripduration')) + geom_histogram(bins=30) + xlab("Trip Duration (mins)") + ylab("Count")
Grouping by station name
One last example is looking at which are the five favorite start and end stations. We'll group the data based on the start and end station names, apply the count function, and sort the values is descending order. Here's the code for that:
groupedStart = df.groupby('start station name') groupedStart['start station name'].count().sort_values(ascending=False)[:5] start station name Pershing Square North 12775 West St & Chambers St 10128 Lafayette St & E 8 St 9246 W 21 St & 6 Ave 9220 E 17 St & Broadway 9036 groupedEnd = df.groupby('end station name') groupedEnd['end station name'].count().sort_values(ascending=False)[:5] end station name Pershing Square North 12511 West St & Chambers St 10189 Lafayette St & E 8 St 9459 E 17 St & Broadway 9273 W 21 St & 6 Ave 9268
Hopefully, the above examples helped introduce some basic uses for the grouping process in pandas to help enhance your analysis and whet your appetite for more! What ideas do you have for further analysis on this dataset? Can you conquer the last idea of looking at the days of the week? Please, take a look at the resources linked below for further investigation!