The Yhat Blog


machine learning, data science, engineering


Using data science to build better products

by Colin Ristig |


Data science as a field of study is growing at an epic pace. There are competitions to build the best predictive algorithms, tons of data blogs/tutorials, and a number of fast-growing (and hugely successful) professional education platforms for teaching data science skills (Insight Data Science, Zipfian Academy, General Assembly, Coursera, Udacity). Data science has even taken a seat at the big kids table, with many of the most prestigious colleges and universities now offering undergrad and graduate level degrees in data science.

The reason? Well part of it is that the demand for advanced analytics has never been higher. Companies of all industries and sizes are finding new ways to use data to streamline processes, reach audiences more effectively, and build more useful, personal, and customized products, services, and experiences. From the most exciting new startups to the biggest and most timeless brands, companies must now communicate and serve customers in an omnichannel world where the bar for superior customer experience is raised everyday.

This is a post on how data science is creating some of the coolest and most useful products ever.

Making data insights useful to "normies" is tough

Data science is about extracting knowledge from data and creating practical, actionable insights to improve some facet of a business (e.g. optimize a process, reduce risk, improve a user experience, make a feature more useful or fun, etc.).

Making sure that data insights are useful to people who don't think about machine learning all day is super important, since the beneficiary of data science work is often a front-line employee, a customer/user or another non-technical stakeholder. For this reason--at least for us at Yhat--data science and product-building go hand-in-hand. Sure, we're data junkies and enjoy walking the parameter space as much as the next guy or girl. But the kicker for us in any data analysis project is the "why".

Raw byproducts of data science (scripts, plots, code, prose, pickle files, whatever else) are interesting in an academic sort of way, but there's nothing more motivating than Spotify Radio serving up 10 winners in a row or using iTranslate's English-to-Finnish interpretor feature to communicate w/ an airbnb host successfully (really...magic).

Integrating a predictive model into the day-to-day

How companies go about using work produced by data scientists is key. Commonsense hopefully tells us that an R script that spits out a credit score isn't tremendously useful to someone applying for a card online.

The ability to go from raw data science byproducts like scripts and plots to a polished final product ready for use by decision makers or customers is, understandably, also quite important.

Efficient means of moving from prototype to production

Yhat ScienceOps is designed to tackle the challenges involved in integrating data scientists' work into apps used by employees and customers quickly, reliably, and without much (sometimes any) coding beyond your team's existing R and Python scripts.

The rest of this post will walk thru an example that illustrates how ScienceOps works, highlighting a few features that make the system well suited for rapidly building and shipping data products.

Example

I've decided to build a housing price predictor, something you might expect real estate companies like Zillow/Trulia or Redfin to use. The data used to train our model is from the UC Irvine ML repository and contains details on Boston housing prices in the 1970's. You can find it here.

The ML algorithm used in this demo is pretty simple: a regular OLS model. I've done this intentionally to really strike this point home: this blog post is about deploying algorithms into production, not building algorithms. I encourage you to fine-tune (or overhaul) the algorithm yourself (after you've deployed it of course).

Building our model

I'll be building our algorithm in Python, however this is all very easily replicated in R as well.

Below, we import our data, specify required packages and select the features to train on.

import pandas as pd
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score

df = pd.read_csv('housing_data.csv', sep=',')
features = df.columns[df.columns != "MEDVALUE"]

Next, we'll set our target variable, MEDVALUE and train our model.

target = "MEDVALUE"
y = df[target]
X = df.drop(target, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = linear_model.LinearRegression()
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
print r2_score(y_test, y_pred)

R² value of .74, not too shabby!

Deploying our Model to ScienceOps

Finally, we'll deploy our model to ScienceOps.

Note: If you plan on deploying your own model, you'll need to change the username/API key.

Go here to create a username and API key

from yhat import Yhat, YhatModel, preprocess, df_to_json

class HousePred(YhatModel):
   @preprocess(in_type=pd.DataFrame, out_type=pd.DataFrame)
   def execute(self, data):
       result = clf.predict(data[features])
       df = pd.DataFrame(data={'predicted_price': result})
       return df

yh = Yhat("USERNAME","API_KEY","http://cloud.yhathq.com/")
yh.deploy("HouseValuePredictor", HousePred, globals())

Specing Out the Application

I've decided to build our user-facing app in Node.js, a javascript application framework, but we could easily use Java/Spring, Ruby on Rails, iOS or whatever else.

Let's layout our feature/engineering requirements before we get rolling:

1: User should be able to enter data into fields detailing a home's characteristics

2: The web server should take the users inputs and send them to the prediction server (ScienceOps)

3: The prediction server should predict a price with the "HouseValuePredictor" algorithm (detailed above)

4: The web server should then get passed the predicted price and return it to the user

Sounds simple enough!

Here's a schematic for how this should all work:


data_flow


Building the app

We'll start with building the UI and then create the backend server functions.

1: User should be able to enter data into fields detailing a home's characteristics:

To do this, we'll need to create a form for the user to input the characteristics about their house. The most important part of this form is the inclusion of action and method tags. These tags allow the submit button to trigger the POST request to the ScienceOps server to request a prediction.

Below is an abbreviated version of the form:

<form class="form-horizontal" role="form" action="/predict" method="post">
        <label class="col-sm-3 control-label" for="Bedrooms">Bedrooms:</label>
        <input class="form-control" type="number" name="Bedrooms" value="1">
        <button type="submit" class="btn btn-success"><h3>Run the Model</h3></button>
</form>

Now that we've got our front-end html page wrapped up, lets build the server-side to send the page to the user.

I won't get into too many of the details of Node.js and Express but you can think of them as a web server that accepts and sends requests.

For starters, we'll need to install the Yhat Node client. You can do this by running the code below:

$ npm install yhat

For our app, we'll want to first send the user an html page called index which is where our form is located:

app.get('/', function(req, res) {
    res.render('index');
});

Sending the POST request to Yhat ScienceOps

This is the exciting part.

We'll combine parts 2-4 into one single function, crazy right!?

2: The web server should take the users inputs and send them to the prediction server (Yhat ScienceOps)

3: The prediction server should predict a price with the "HouseValuePredictor" algorithm (detailed above)

4: The web server should then get passed the predicted price and return it to the user

We have to capture the form data and send it to the server.


First, we'll authenticate our app to access our algorithm. This uses the Yhat npm package we installed earlier.

yh = yhat.init("YOUR_USERNAME", "API_KEY", "http://cloud.yhathq.com/");

Next, we'll build a function that does a few things, namely:

  1. Parse the html form for the data when "Run the Model" is clicked.

  2. Send the data to the HouseValuePredictor algorithm hosted on ScienceOps

  3. Receive the POST response from ScienceOps

  4. Render a new page with the prediction

The function below does just this:

#When '/predict' occurs, run the function
app.post('/predict',function(req,res){

    # Parse the body for the data
    data= {
        "Bedrooms": [parseFloat(req.body.bedrooms)]
        ,"Bathrooms": [parseFloat(req.body.bathrooms)]
        ,"TotalSquareFeet": [parseFloat(req.body.totalsqft)]
        ,"Neighborhood": [parseFloat(req.body.neighborhood)]
    };

    # Get the prediction from the HouseValuePredictor algorithm!
    yh.predict("HouseValuePredictor", data, function(err, rsp) {
        if (err) {
            console.log("Error connecting to server: " + err);
        } else {
            console.log(rsp.result);
        }

        #Format the prediction into $USD
        var formatted_price = accounting.formatMoney(rsp.result.predicted_price*1000);

        #Render the new page with the prediction
        res.render('response', {formatted_price: formatted_price});
    });
});

I realize that last code snippet was....not exactly a snippet so lets pause for a sec and see what this actually looks like in production

You can visit the app here, or you can use it in the iframe below.


Why is this so great?

We removed all dependency between our web application toolset and our data science toolset for starters.

And while we somewhat glossed over it, ScienceOps, hosts models and makes them accessible thru several standards-compliant APIs. Deploy your models to ScienceOps and then query them to make predictions from any other application. ScienceOps also eliminates overhead needed to maintain models running in production, allowing analysts to focus on new problems rather than on manual maintenance of existing models.

Building machine learning models is cool, but products that actually use machine learning are 1000x's cooler, more useful, and more valuable.

Code from this post

All the code used to fit our model and build the web app is on Github.

Housing Predictor Model

Housing App

Try ScienceOps Free

You can checkout ScienceOps yourself for free! Give it whirl and let us know if you have questions.

Try ScienceOps Free

ScienceOps for Enterprise

We ship an enterprise version of ScienceOps, so if you're interested to learn more about how it'd work at your company, please get in touch!

Learn About ScienceOps for Enterprise

Email me

Seriously, I'd love to hear from you!

c [at] yhathq [dot] com



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.