The Yhat Blog


machine learning, data science, engineering


Introduction to Split Testing Models

by Yhat |


A Quick A/B Testing Primer

It goes by many names: "A/B Testing", "Split Testing", "Segmentation Testing". They're all talking about the same thing and in principle it's just a simple experiment:

How does A perform compared to B?

In the web-world you might have heard of companies such as Optimizely that help companies experiment and track the results of variants to their applications (among other things). So while this topic is nothing new, there isn't a lot of talk about how it applies to predictive models.

Why should I split test my models?

In the same way that you might want to test if a green button converts better than a blue button, you might one to do the same thing for the reccomendatin engine you built using NNMF and the other one you built using k-nearest neighbors. While the NNMF version might perform better on new customers, the k-nearest neighbors model might do better on customers with a larger history of purchases.

Well the only way to know which one is best is to test them out! The process is really no different than an A/B test you'd run on your website. You've still got the same basic ingredients:

  • 2 treatments for a given population
  • A random split between both groups
  • A metric(s) to indicate which treatment performs better

This might be a good time to note that while this seems similar to cross-validation, or split/train/test, split testing models is less about determining the quality of the model you've built and more about comparing the effectiveness of a new model.

Champion / Challenger

In predictive analytics and particularly risk analytics, model split testing is often referred to as "Champion / Challenger". Again, this is the same concept: comparing an established model or strategy (the existing champion) with a new strategy (the challenger) and determining if the challenger performs better!

So before you start testing your NNMF reccomender vs. your k-nearest neighbor recommender, consider these things:

  • Patience is key: Depending on the feedback loop, or the amount of time between making a decision and determining the outcome of that decision, your test could take anywhere from 5 minutes (serving display ads) to 12 months (credit card payback rates).
  • If you're small, test BIG: Testing small changes or alterations to your models and strategies is going to yield small improvements. So if you're performing a test on a decision making process that generates $50M a year, then a 1% bump in conversion might be a big deal. But if you're testing smaller, less optimized decisions you should be taking big swings. Look for ways that you can see 5%, 10%, even 50% improvements!
  • Coordinate across teams: If you're testing a model that is integrated with another application (say for instance your company's website), you'll want to make sure that everyone is on the same page. Say for instance you decide to start randomized trials on your pricing engine but don't tell the web team. You could find yourself in a situation where your end-users are getting differnet prices depending on when they visit the website--this doesn't typically go over well.

The "hashing trick"

A handy hack I've used in the passed is a "hashing trick". To ensure that data that was sent to my models recieved the same treatment if it happened to be sent over multiple times I used the following criteria for my random split.

First, run the incoming data through a 1-way hashing function. This will turn something like this: {"name": "Greg"}, into this: a93bfdd373704747170c56b4c8b401b9.

This might seem silly at but it's super handy. MD5 hashing is consistent, so that means ANY time I hash {"name": "Greg"}, I'll get back a93bfdd373704747170c56b4c8b401b9. Using the resulting a93bfdd373704747170c56b4c8b401b9 as a way to determine which treatment to send the data to. So you might want to send anything that starts with an "A-L" to Treatment A and anything else to Treatment B. Pretty simple but it works and it has the added benefit that you can compute which treatment a given datapoint will receive (it's not just based on a Math.random() call somewhere).

You can also use something like a customer ID for this, just make sure beforehand that it's truly random!

Split Testing in ScienceOps

In the latest release of ScienceOps we launched initial support for Split Testing in Python. Now it's super easy to create your own split tested models without. Here's how it works:

  • Create 2 (or more) versions of the same decision making endpoint to test
  • Define a SplitTestModel using the ScienceOps Python client
  • Define Variants using the setup_variants method. For each treament, or "Variant', this will indicate the labels, methods to exeucte, and the percent of traffic to allocate to a given variant.
  • Deploy your model!

That's all there is to it. ScienceOps will handle the hot deployment, randomization, and labeling of each resulting datapoint for you. Now it's time to sit back, relax, and let your models duke it out!

Final Thoughts

There'll be more to come in the way of split testing in ScienceOps in the near future. The initial release allows for fairly basic types of split testing and reporting but stay tuned for enhancements and improvements in the near future!

If you'd like to learn more about ScienceOps, check out the 2-min video below or schedule a demo with our sales and engineering teams.



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.