The Yhat Blog

machine learning, data science, engineering

pandas & google analytics

by yhat |

Google Analytics is a super important tool for understanding web traffic. It's great for supplementing and validating all sorts of analysis, but going to your web browser and logging into Google all the time makes it a pain to tie GA data into your analysis. Take basic conversion funnel analysis for example. You probably have access to users/signups, sales, and other conversion activities in your database, but the raw visitor data critical to understanding the top of the funnel is in Google Analytics.

pandas has a great interface on top of the Google Analytics API. It's pretty simple to get up and running, and it plays nice w/ pandas io, aggregation, and time series features.

This is a post about and the Google Analytics Data API.

API Access & Configuration

You need to (a) request access to the Google Analytics API, (b) create a Client ID; and (c) download your client ID "secrets" file which contains your API authentication details.

Request Access to the Google Analytics API

Visit the Google API Console. This will take you to your dashboard. In the left column, click on Services and make sure Google Analytics is set to "on" like so:

Create a client ID

Now click API Access in the left column of the API Console window. Click on the button that says "Create a client ID" (if you already have one in may say "create another client ID").

Choose the "Installed Application" radio button. To be honest, I'm not really sure why you choose this, but it seems to work for me.

Now that you have a client ID, you need to click on "Download JSON" in the right column of the console UI to download your API credentials.

It'll ask you where you want to save the file. Ultimately, this file needs to be stored in the same directory as the module. You can try to navigate to that directory in the browse window or move it there manually.

mv ~/Desktop/client_secrets.json /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/


Google makes this part a breeze for the most part, and with pandas things are even easier. You can read the formal specifications for authenticating with the Google Analytics API in the docs. Or you can skip to the fun part and run a few lines of pandas-powered code.

df = ga.read_ga(metrics, dimensions, start_date)

When you run this line, pandas will look in the directory for a file called "client_secrets.json". This is why it was important to save that file exactly in the right place. Your web browser should open and bring you a google authentication screen which will prompt you to authorize your new API client (i.e. your laptop).

If all goes according to plan, you should see something like this:

If you have trouble authenticating, you'll most likely see an error that looks like this:

HttpError 403 
when requesting returned "Access Not Configured"

To resolve this, you can pass your account details to ga.read_ga like so:

df = ga.read_ga(metrics, dimensions, start_date, account_id="12345678")

If you don't know your account_id, no sweat. You can get that information in a couple of ways.

How to find your Google Analytics account_id or web_property_id

Your account ID can be found in the Admin section of your Google Analytics account. It will look something like this:

The "Property ID" refers to your Web Property ID. The number in the middle of the web property ID is your account ID.

You can also find the same details by checking the javascript on your website. View source on a page which has the Google Analytics script installed and look for a line like this:

        ['_require', 'inpage_linkid', pluginUrl]
     , ['_setAccount', 'UA-34340646-1']
      , ['_setDomainName', '']
      , ['_setAllowLinker', true]
      , ['_trackPageview']

Again, the _setAccount portion refers to your web_property_id, and the number in the middle of this web property ID is your account_id (in this case 34340646).

If you're still having problems, you can dig into the module where you'll find a few more functions for accessing other administrative account information.

ac = ga.GAnalytics()
profile = ac.get_profile(account_id="34340646", web_property_id="UA-34340646-1")
ac.get_account(name="Example", id="34340646")
{u'childLink': {u'href': u'',
  u'type': u'analytics#webproperties'},
 u'created': u'2012-12-18T03:21:11.375Z',
 u'id': u'34340646',
 u'kind': u'analytics#account',
 u'name': u'Example',
 u'selfLink': u'',
 u'updated': u'2013-01-22T05:09:06.673Z'}

Remember that this information is public on the web. So while it's necessary for authenticating your requests to Google, none of this works without the client_secrets.json file which you should keep secure.

Making requests

Making requests to the Google Analytics API is super easy with pandas. It works just like most of the functions.

read_ga takes:

  • metrics: a list of measures
  • dimensions: a list of dimensions
  • And start_date: a string, date, or datetime describing your query start date

You can optionally pass an end date or a list of filters for specific hostnames, pages, geographies, etc among other things. All of that is covered pretty well in the Google API docs, so you should explore that to learn about everything you can do. I'm pretty sure you can fetch data from Custom Segments that you setup as well, though I haven't explored that yet.

Valid Measures & Dimension Combos

One important thing to know is that not all dimensions and measures can be queried together. In other words, only certain dimension/measure combos are valid if used simultaneously. You can find out if a metrics-dimension combo is valid here.


Now that you're configured, let's plot some shit.

Use pandas read_ga.

Series plotting recipe:

Download code from this post as an IPython notebook.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.