The Yhat Blog

machine learning, data science, engineering

Statistical Quality Control in R

by yhat |

Quality Control and quality assurance are important functions in most businesses from manufacturing to software development. For most, this means that one or more people are meticulously inspecting what's coming out of the factory, looking for imperfections and validating that requirements for products and services produced are satisfied. Often times QC and QA are performed manually by a select few specialists, and determining suitable quality can be extremely complex and error-prone.

This is a post about quality assurance automation using statistics and R.

What is statistical quality control?

Statistical quality control is a quantitative approach to monitoring and controlling a process. The best way to explain it is though an example.

Say you're the manager at a factory that manufactures lug nuts. And let's suppose your 10 mm long lug nuts continue to function within a 10 percent margin of error (i.e. customers have a tolerance for error of roughly +/- 1 mm in length). As long as your producing lug nuts measuring between 9 and 11 mm in length, you'd consider your machine to be functioning as designed.

How would you know if your machine has suffered a malfunction? A 9.7mm lug nut could be the sign that your machine is producing lug nuts that are too small, or it could just be natural error that occurs for a machine that's supposed to make 10mm lug nuts.

Take a look at the plots below. Can you tell which one has experienced a change in the mean?

Framing the Problem

As a smart manager, you're using statistical quality control to identify issues with your machine. You can think of each lug nut as an observation. Since we're trying to make 10 mm lug nuts, we will assume that the mean lug nut (lug nut) is 10 mm. This means that over time, the mean lug nut diameter should approach 10 mm. We're also going to assume that our machine's mistakes are normally distributed which in our case means that more lug nuts are much more likely to be closer to 10 mm than farther away from it.

The qcc package

So we've come up with a good framework for our problem, now what?. Enter the qcc package in R. This magical little library was built by Luca Scrucca for nothing but statistical quality control. It's extremely easy to use. You provide it with data and it tells you which points are considered to be outliers based on the Shewart Rules. It even color codes them based on how irregular each point is. In the example below you can see that for the last 10 points of the 2nd dataset I shifted the mean of the data from 10 to 11.

You can also define training/test set from withing qcc. Simply add the data you want to calibrate it with as the first parameter, then add the parameter newdata with your test data (see code above and plot below).

Some processes might not be have normally distributed errors, but I've found that there are often ways in which you can transform your error term to make it behave normally. It all just depends on how creative you are.

Building Your Own Quality Control Charts

As great as qcc is, it doesn't have my favorite type of statistical quality control - The Western Electric Rules (WER). The WER were first used by (you guessed it) the Western Electric Company as a way to standardize how their employees monitored their electric lines. While the Western Electric Co. isn't around anymore, the rules they came up with are still really useful for monitoring processes. In a minute we'll show you how to implement them yourself, but first let's explain how they work...

The WER are remarkably straightforward and intuitive. For a recurring process take a sampling of points and measure the mean and the standard deviation. We'll use the mean as the "center-line". Then create 3 zones above and below the center-line, each 1 standard deviation in width.

Based on these zones, the Western Electric Co. came up with a set of rules to determine if a process is broken:

  1. One point lies beyond Zones +/- 3
  2. 2 out of 3 consecutive points lie in Zone +/- 3 (and on the same side of the center-line)
  3. 4 out of 5 consecutive points lie beyond the Zone +/- 2 (and on the same side of the center-line)
  4. 8 consecutive points lie one the same side of the center-line

Implementing them on your own

Despite how cool the WER are, they aren't in the qcc package. Luckily, with R they shouldn't be too tricky to implement ourselves.

Defining the Zone

The first thing we need to do is define the thresholds for each of the zones. Each zone is one standard deviation in width and there are 3 zones on each side of the center-line. Since we also want to know what the top/bottom of Zone +/- 3 is, we'll need to calculate 8 zones. What we end up with is a grid. The numbers in columns 1 and 2 correspond with the boundaries for Zone -3, columns 2 and 3 correspond with Zone -2, etc.

Finding Outliers

Since we know what the range is for each zone, now we need to determine which zone every point falls in. First we're going to compare our points to each zone by using x > zones. This gives us a giant matrix of TRUE/FALSE values. We can then use this to calculate the zone that each point falls into by summing the rows (TRUE/FALSE evaluates to 1/0 when summed). We can use rowSums which does row-wise summation on a data.frame/matrix. The value of each item is the zone which it belongs in plus 4 (the extra 4 is because a value of 1 maps to zone -3), so we subtract 4 from the vector and...voila we have the zone that each point falls into.

Once we've determined which zone a given point falls in, we can compute rules for each index within the group. If a given index violates a rule, we flag it with a +/- 1 (+ for zone above center-line, - for zone below center-line).

Putting them together

Using the functions we've defined, we can now find compute the rules for each point adn then assign a color to any violations.

Visualizing It All

With all of our data points, we can now make a quality control chart. We're going to use the original points and overlay them with the zones and then make each point the color of the rule if breaks (if any).

You can get the entire script here.

Deploying to Yhat

Deploying this one is really easy. Since we've encapsulated most of the hard part into our helper functions, we just need to call compute_violations on our series of data. We can bypass the model.transform function since we're working with the raw data itself, and we don't have any external dependencies so we don't need to fill out model.require.

Final Thoughts

Even though it's an old topic, statistical quality control is still highly relevant. While you might not be working at a lug nut factory, you probably have lots of jobs, processes, logs, or database metric that you could monitor using control charts.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.