The Yhat Blog

machine learning, data science, engineering

Fuzzy Matching with Yhat

by Greg |

The Problem

Ever had to manually comb through a database looking for duplicates? Anyone that's ever had a data entry job probably knows what I'm talking about. It's not fun! In this post I'm going to show you how you can write a simple, yet effective algorithm for finding duplicates in your data.

Some Example Data

Our example data consists of 500 records, each containing an id, 2 names, and 2 addresses. The address only consists of a street address. If you're lucky enough to have a dataset that contains things like city, state, and zip, you can also use those as anchor points for matches (i.e. zipcode based heuristics).

id name1 street1 name2 street2 match
0 1 Karina Stafford P.O. Box 679, 2851 Feugiat. St. Tallulah B. Mccray Ap #684-5041 Malesuada Street 0
1 2 Leila Bentley 8816 Libero. Ave Sylvia F. Jacobs 5949 Massa. Ave 0
2 3 Keely E. Bernard 189-357 Ipsum Road Hiram Reid 940-5809 Amet Street 0
3 4 Garrett, Amber D. P.O. Box 856, 3334 Sed Road Cleo V. Macias 4406 Placerat, Road 0
4 5 Barnes, Isaiah C. 296-6240 Nisl. Av. Lacota Snyder Ap #788-1556 Tellus, Rd. 0

Building your algorithm

There are lots of ways to go about building a record linkage algorithm. My preferred method (which is also relatively simple) is composed of a few steps:

  • Normalize
  • Featureize
  • Classifying
  • Scoring

1) Normalizing your data

Since we're dealing with inherently messy data (otherwise, why would we need to match it!), it is STRONGLY advised that you first standardize your data . Some common operations are making all text upper/lowercase, removing stop words, reconciling synonyms/abbreviations (Ave -> Avenue) get the point.

This is all fairly trivial to do with a scripting language like R or Python. Here's an example in Python:

2) Creating some basic features

Since we want to automate the process, we're going to need a way to turn a name and address into a set of numbers. I've found that by using some basic string distance metrics to calculate the similarity between 2 given names or addresses, you can get a pretty good quantitative representation of the words you're trying to match.

An example might be using something like the Levenshtein distance to calculate similarity between "John B Good" and "Johnny B Goode". The Levenshtein distance calculates the number of changes you need to make to get from one name to the other. In this case it would be X (explanation of steps).

Vladimir Levenshtein

Dr. Vladimir Levenshtein

To calculate a vector, I'm going to use the FuzzyWuzzy package in Python:

3) Building your classifier

Lastly you need a way to classify whether 2 records match one another. This is actually the easiest (and most fun) part. Since we've created a numerical representation of our data, we can select a few algorithms and see how they perform. I'm going to use scikit-learn in Python as an example:

4) Scoring

Many times, it's good to have heuristics associated with the output of your algorithm. For instance, if you're sifting through a database of escaped convicts, you're probably going to want to limit your exposure to Type II errors. Whereas if you're reconciling a e-mail subscription with 25MM+ addresses having a few dupes slip through the cracks won't mean the end of the world.

type i error, type ii error

Training using sb

Above you notice that I only trained on a small dataset. That's because my laptop would have crashed if I tried to train on anything >100000 rows (roughly) . I'm going to use the ScienceBox along with the sb client to train a much larger dataset.

sb run, fuzzy matcher

Since it's going to take a little while, I'm going to tell the ScienceBox to send me an email when it's finished running.

sb run, fuzzy matcher

Final Thoughts

I hope this gave you some insight into how you can develop your own fuzzy matching algorithms without having to spend lots of time in R&D mode. Here are some other helpful resources for reconciling datasets:

And for more information on ScienceBox:

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.