How can data science improve products?
What are predictive models?
How do you go from insight to prototype to production application?
This is an excerpt from "Applied Data Science," A Yhat whitepaper about data science teams and how companies apply their insights to the real world. You’ll learn how successful data science teams are composed and operate and which tools and technologies they are using.
We discuss the byproducts of data science and their implications beyond analysts’ laptops and answer the question of what to do with predictive models once they’re built. Lastly, we inspect the post-model-building process to highlight the most common pitfalls we see companies make when applying data science work to live data problems in day-to-day business functions and applications.
Describing data science
In the wake of an increasingly digital economy, businesses are racing to build operational knowledge around the vast sums of data they produce each day. And with data now at the center of almost every business function, developing practices for working with data is critical regardless of your company’s size or industry.
“Data science,” one of many recently popularized terms floating amidst the myriad of buzzwords and big data hoopla, is a field concerned with the extraction of knowledge from data. Practitioners—aptly named “data scientists”—are those charged with solving complex and sophisticated problems related to data usually employing a highly diversified blend of scientific and technical tools as well as deep business and domain expertise.
“What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment." -John Mount & Nina Zumel, Practical Data Science with R
The central goal of data science
As is the case with any analytical project, the central goal in data science is to produce practical and actionable insights to improve the business. That is to say, data scientists overcome complexities involved in data to empower businesses to make better operational decisions, optimize processes, and improve products and services used by customers and non-technical employees.
Profile of a typical data science project
Project scope and definition
With broad strokes, a data science project begins with some question, need, or goal in mind and with varying degrees of focus. Accordingly, a data scientist’s primary task at the start of a new project is to refine the goal and develop concrete project objectives.
Analysts will first conduct a preliminary survey of the data, applying domain knowledge to develop a clear and succinct problem definition to serve as the principal object of study.
Identify relevant data sets
With a narrow and expressive definition of the problem, data scientists can begin to evaluate different data sets to identify which variables are likely to be relevant to the problem they are trying to solve. Evaluating which data sets should be used for the project, however, is not an activity performed in isolation. Most companies have numerous data sets, each highly diverse in shape, composition and size. Analysts may or may not be familiar with a particular data source, how to query it, where it comes from, what it describes or even that it exists.
For these reasons, quantitative analysts are usually working in proximity to or in direct collaboration with engineers, marketers, operations teams, product managers, and other stakeholders to gain a robust and intimate understanding of the data sources at their disposal.
Collaboration at this stage is not only valuable for identifying which data are relevant to a problem but also for ensuring the ultimate viability of any resulting solution. Hybrid teams composed of stakeholders in separate functions produce deeper collective understanding of both the problem and the data at the center of any project. Knowing how a data set is created and stored, how often it changes, and its reliability are critical details that can make or break the feasibility of a data product.
For example, consider a new credit-scoring algorithm more accurate than previous methods but that relies on data no longer sold by the credit bureau. Such circumstances are common today given that data sets are so diverse and subject to frequent change. By incorporating interdepartmental expertise in the early stages of model development, companies dramatically reduce the risk of pursuing unanswerable questions and ensure data scientists are focusing attention on the most suitable data sets.
After firming up the project’s definition and completing a preliminary survey of the data, analysts enter the model-building phase of analytics lifecycle. The notion of “model” is often obscure and can be difficult to define, even for those well versed in data science vocabulary.
A statistical model, in short, is an abstract representation of some relationship between variables in data. In other words, a model describes how one or more random, or independent, variables relate to one or more other dependent variables. A simple linear regression model might, for example, describe the relationship between years of education (X) and personal income (y).
A statistical model is an abstract representation of some relationship between variables in data.
But linear regression is far from the only way to represent the relationships in data, and identifying the right algorithms and machine learning methods for your problem is largely an exploratory exercise. Data scientists apply knowledge of the business and advanced research skills to identify those algorithms and methods most likely to be effective for solving a problem. Many and perhaps most data science studies are bound up with solving some combination of clustering, regression, classification, and/or ranking problems. And within each of these categories are numerous algorithms that may or may not be suitable for tackling a given problem.
To that end, the model-building phase is characterized by rigorous testing of different algorithms and methods drawing from one or more of these problem classes (i.e. clustering, regression, classification, and ranking) with the ultimate goal being to identify the “best” way to model some underlying business phenomenon. “Best,” importantly, will take on a different meaning depending on the problem, the data, and the situational nuances tied to the project. For example, the “best” way to model the quality of the Netflix recommendation system is very different from the “best” way to model the quality of a credit-scoring algorithm.
Actionable data science & applications in operations
When a data science project progresses beyond the model-building phase, the core question is how best to take advantage of the insights produced. This is a critical junction and one ultimately determines the practical ROI your data science investment.
Extracting value from data is like any other value chain. Companies expend resources to convert raw material—in this case data—into valuable products and services suitable for the market.
A data product provides actionable information without exposing decision makers to the underlying data or analytics. Examples include: movie recommendations, weather forecasts, stock marketing predictions, production process improvements, health diagnoses, flu trend predictions, and targeted advertising. -Mark Herman, et al., Field Guide to Data Science
As is the case with any value chain, a product gains value as it progresses from one lifecycle stage to the next. Therefore, the manner in which activities in the chain are carried out is important as it often impacts the system’s value.
Consider the product recommendations example again—our goal is to increase average order size for shoppers on our website by recommending other products users will find relevant.
Data science lifecycle steps:
- Refine the problem definition
- Survey the raw material and evaluate which data to include in the model
- Rigorously test modeling techniques
- Identify a winning modeling strategy for implementation
- Integrate recommendations into the website to influence customers
Common sense indicates that progressing through step four without achieving step five falls short of the objective. But, sadly, this is a common scenario among companies developing data science capabilities. Similarly, it is often the case that hypotheses are disproved only after companies have invested substantial time and effort engineering large-scale analytics implementations for models which later prove to be suboptimal or entirely invalid.
Why building data driven products is hard
To read the second half of the whitepaper, download below!