The Yhat Blog


machine learning, data science, engineering


Analysing your e-commerce funnel with R

by Justin Marciszewski |


This post is by Justin Marciszewski, Founding Partner at Harbor Island Analytics, an analytics consultancy specializing in e-commerce, digital marketing, and user behavior strategy. Harbor Island helps clients use data to identify new opportunities, reach audiences more effectively, and make stickier apps.

Find Justin on LinkedIn, Twitter, or Github, or reach him by email at justin [at] harborislandanalytics [dot] com.

Intro

Optimizing on-site or in-app sales is one of, if not the most, common problems in online retail. I thought it'd make a great topic for a post since many folks are familiar with the problem and will, hopefully, find the discussion productive and useful.

This post explores methods we use at Harbor Island when working with clients to boost conversion rates and improve checkout funnels on their sites and e-commerce web apps. Our goal is to answer this question:

How do we measure the results of our actions with some degree of statistical confidence? In other words, how do we know that any change in conversion rate is the result of our actions and NOT by random chance?

One note--this post doesn't talk about A/B testing. A/B testing is a valuable practice, but running experiments with two or more simultaneous checkout flows isn't always practical for small and medium e-retailers. This post covers alternative ways to ask and answers questions about a sales conversion funnel.

Dataset

Working with different online retailers has given me ample opportunity to get familiar with lots of different web analytics services.

From Mixpanel and Heap to Shopify and Magento there are dozens of platforms for tracking/measuring what users are doing on your site or in your app. The most common one in my experience is Google Analytics, so we'll use GA for this post.

Packages

To collect data from Google Analytics we'll be using the rga package for R which you can find here: skardhamar/rga - an R package for querying Google Analytics on Github

Installing and using rga for R

First, we'll install and load the necessary packages:
# install.packages("devtools")
library(devtools)

# install_github("rga", "skardhamar")
library(rga)

# Save the GA instance locally. it'll open your browser
# and prompt for auth once. reuse the local auth file to
# avoid re-authenticating later.
rga.open(instance="ga", where="ga.rga")

Setup GA Parameters

Next, let's set up the parameters in a way that we can easily find and change them for future re-runs of the analysis:

Note: A full List of Google Analytics parameters can be found here: Google Analytics Dimensions & Metrics Reference

start_date <- "2013-01-01"
end_date <- "2014-07-20"
metrics <- "ga:goal1Starts,ga:goal1Completions,ga:goal1ConversionRate"
sort <- "ga:date"

Query GA API & Collect Data

Now we can query and collect data from Google Analytics.

# swap `XXXXXX` for your profile_id
src <- ga$getData(
    "ga:XXXXXX",
    start.date = start_date,
    end.date = end_date,
    metrics = metrics,
    sort = sort,
    batch = TRUE
)

Cleanup

Let's checkout the data

str(src)
## 'data.frame':    566 obs. of  4 variables:
##  $ date               : Date, format: "2013-01-01" "2013-01-02" ...
##  $ goal1Starts        : num  45 62 56 43 38 46 66 47 60 67 ...
##  $ goal1Completions   : num  28 36 28 33 26 27 48 31 35 38 ...
##  $ goal1ConversionRate: num  0.563 0.63 0.391 0.576 0.51 ...
summary(src)
##       date             goal1Starts    goal1Completions goal1ConversionRate
##  Min.   :2013-01-01   Min.   :  0.0   Min.   :  0.0    Min.   :0.000
##  1st Qu.:2013-05-22   1st Qu.: 28.0   1st Qu.: 14.0    1st Qu.:0.260
##  Median :2013-10-10   Median : 46.0   Median : 26.0    Median :0.435
##  Mean   :2013-10-10   Mean   : 53.3   Mean   : 29.8    Mean   :0.466
##  3rd Qu.:2014-02-28   3rd Qu.: 68.0   3rd Qu.: 39.0    3rd Qu.:0.641
##  Max.   :2014-07-20   Max.   :384.0   Max.   :186.0    Max.   :1.374

Looks good! So let's go ahead and 'group' the data by the site design (we know that the new checkout page design went live on June 3, 2014):

Aggregate by checkout page designs

In this example, the client had recently made a substantial change to the design/layout of the checkout page on their site in early June.

ga <- src

design <- function(date){
  if(date >= "2014-06-03")
    return("New")
  if(date <= "2014-06-02")
    return("Old")
  else
    return(NA)
}

ga$design <- as.factor(sapply(ga$date, design))

# Save Data Locally (just in case)
write.csv(ga, "ga_cr_data.csv", row.names = FALSE)
summary(src)
##       date             goal1Starts    goal1Completions goal1ConversionRate
##  Min.   :2013-01-01   Min.   :  0.0   Min.   :  0.0    Min.   :0.000
##  1st Qu.:2013-05-22   1st Qu.: 28.0   1st Qu.: 14.0    1st Qu.:0.260
##  Median :2013-10-10   Median : 46.0   Median : 26.0    Median :0.435
##  Mean   :2013-10-10   Mean   : 53.3   Mean   : 29.8    Mean   :0.466
##  3rd Qu.:2014-02-28   3rd Qu.: 68.0   3rd Qu.: 39.0    3rd Qu.:0.641
##  Max.   :2014-07-20   Max.   :384.0   Max.   :186.0    Max.   :1.374
##  design
##  New: 48
##  Old:518  

Great, looks good!

Note: I don't typically talk to myself this much when I'm doing the work on my own...

Visualize

library(ggplot2)
library(scales)

daily_CR_plot <- ggplot(ga,
    aes(date,goal1ConversionRate, colour = design)) +
  geom_point() +
  stat_smooth() +
  ggtitle("Site Design & Conversion Rates\nJan-Jun 2014") +
  xlab("date") +
  ylab("Conversion Rate") +
  scale_y_continuous(labels = percent, limits=c(0, 1))

daily_CR_plot 

If you're noticing that some days appear to have conversion rates of over 100%, so did I. I wasn't able to figure out what may have triggered that, as the client was unaware of it. This type of issue crops up all the time and datasets are rarely ultra clean. In any event, we'll treat these as outliers later on.

Here we'll group by site design and compute the median conversion rate for each version of the funnel.

library(plyr)
library(scales)

medianCR <- ddply(ga, .(design), summarise,
                  medianCR = median(goal1ConversionRate))

total_CR_boxplot <- ggplot(ga, aes(design, goal1ConversionRate, fill = design)) +
  geom_boxplot() +
  geom_text(data=medianCR, aes(design, medianCR, label=percent(medianCR), size=3, vjust=-.5)) +
  theme(legend.position = "none") +
  ggtitle("Checkout Funnel Conversion Rate by Site Design\nJan 1 - Jul 20, 2014") +
  xlab("Site Design") +
  ylab("Checkout Funnel Conversion Rate") +
  scale_y_continuous(labels = percent, limits=c(0, 1))

total_CR_boxplot

Yikes! Changing the site's checkout page layout and design actually decreased conversion rates. This calls for deeper investigation to determine what might be going on.

Are the results significant?

Going back to the beginning when we defined our goal, what we need to do first is the following:

Establish a baseline conversion rate. As calculated above, ours is 46%

Decide on minimum % change we want to be able to detect. We'll say that we want to detect a minimum 10% (up OR down) change based on our actions.

Ensure our sample size (# of visitors) is large enough. I'm going to use Evan Miller's great Sample Size Calculator for this.

  • Total Visits needed (minimum sample size): 1,561
  • Total Visits to new design: 1,705

Testing Significance with a t-test.

t.test(goal1ConversionRate~design, ga, var.equal=TRUE)
##
##  Two Sample t-test
##
## data:  goal1ConversionRate by design
## t = -3.395, df = 564, p-value = 0.0007336
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.21316 -0.05692
## sample estimates:
## mean in group New mean in group Old
##            0.3425            0.4776

New Mean: ~34%

Old Mean: ~48%

p-value: (= 0.0007336)

Finally--and most importantly--we can see that the p-value (= 0.0007336) is < .05. Simply speaking, this means the difference in the checkout funnel conversion rates are not due to randomness but rather due to our actions of changing the design.

So it looks like we're good to go and proclaim our results significant! (even though they are not as good as we'd hoped)

While quite obvious in this instance, in other cases the difference in the checkout funnel conversion rate can less clear. One would have to proceed using a more advanced method like the one found in Nathan Yau's blog post on Detecting and Plotting Sequence Changes.

Automation

The work described above is fairly generalizable across clients since most have Google Analytics and most are e-commerce sites that think about the sales funnel. I use ScienceBox to make on-boarding new clients, jump starting new projects and rerunning reports I made previously easier and faster.

These are probably my favorite ScienceBox features:

  1. Zero setup related to R environments. And comes with RStudio Server.
  2. I can rerun scripts on a weekly or monthly schedule to generate up-to-date reports for each of my clients easily.
  3. Easy to sync projects between my Mac and AWS

And a super nice bonus feature that I love is that ScienceBox will put itself to sleep and wake itself up to run jobs that I've scheduled. That's a great one since it saves me from running up big AWS bills each month!

And another thing...

If you haven't checked out Plot.ly, you definitely should.

It's an awesome app/framework that I've been getting a lot out of and having a lot of fun with too!!

library(plotly)
py <- plotly()

ga_py <- ga # Create new data set for plot.ly graph
ga_py$date <- as.POSIXct(paste(ga_py$date, " 00:00:00", sep = ""))

daily_CR_plot2 <- ggplot(ga_py, aes(date,goal1ConversionRate, colour = design)) +
  geom_line() +
  ggtitle("Daily Checkout Funnel Conversion Rate by Site Design (Jan 1 - Jul 20, 2014)") +
  xlab("date") +
  ylab("Conversion Rate (%)") +
  scale_y_continuous(labels = percent, limits=c(0, 1))

py$ggplotly(daily_CR_plot2,
            kwargs = list(
              filename="Blogging/Checkout Funnel Conversion Rate Comparison - Time Series",
                        fileopt="overwrite"
                        )
            )

They play well with R among other languages generate great visuals.

Misc



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.