Ever had to manually comb through a database looking for duplicates? Anyone that's ever had a data entry job probably knows what I'm talking about. It's not fun! In this post I'm going to show you how you can write a simple, yet effective algorithm for finding duplicates in your data.
Some Example Data
Our example data consists of 500 records, each containing an id, 2 names, and 2 addresses. The address only consists of a street address. If you're lucky enough to have a dataset that contains things like city, state, and zip, you can also use those as anchor points for matches (i.e. zipcode based heuristics).
|0||1||Karina Stafford||P.O. Box 679, 2851 Feugiat. St.||Tallulah B. Mccray||Ap #684-5041 Malesuada Street||0|
|1||2||Leila Bentley||8816 Libero. Ave||Sylvia F. Jacobs||5949 Massa. Ave||0|
|2||3||Keely E. Bernard||189-357 Ipsum Road||Hiram Reid||940-5809 Amet Street||0|
|3||4||Garrett, Amber D.||P.O. Box 856, 3334 Sed Road||Cleo V. Macias||4406 Placerat, Road||0|
|4||5||Barnes, Isaiah C.||296-6240 Nisl. Av.||Lacota Snyder||Ap #788-1556 Tellus, Rd.||0|
Building your algorithm
There are lots of ways to go about building a record linkage algorithm. My preferred method (which is also relatively simple) is composed of a few steps:
1) Normalizing your data
Since we're dealing with inherently messy data (otherwise, why would we need to match it!), it is STRONGLY advised that you first standardize your data . Some common operations are making all text upper/lowercase, removing stop words, reconciling synonyms/abbreviations (Ave -> Avenue)....you get the point.
This is all fairly trivial to do with a scripting language like
Here's an example in
2) Creating some basic features
Since we want to automate the process, we're going to need a way to turn a name and address into a set of numbers. I've found that by using some basic string distance metrics to calculate the similarity between 2 given names or addresses, you can get a pretty good quantitative representation of the words you're trying to match.
An example might be using something like the Levenshtein distance to calculate similarity between "John B Good" and "Johnny B Goode". The Levenshtein distance calculates the number of changes you need to make to get from one name to the other. In this case it would be X (explanation of steps).
To calculate a vector, I'm going to use the
FuzzyWuzzy package in Python:
3) Building your classifier
Lastly you need a way to classify whether 2 records match one another. This is
actually the easiest (and most fun) part. Since we've created a numerical
representation of our data, we can select a few algorithms and see how they
perform. I'm going to use
scikit-learn in Python as an example:
Many times, it's good to have heuristics associated with the output of your algorithm. For instance, if you're sifting through a database of escaped convicts, you're probably going to want to limit your exposure to Type II errors. Whereas if you're reconciling a e-mail subscription with 25MM+ addresses having a few dupes slip through the cracks won't mean the end of the world.
Above you notice that I only trained on a small dataset. That's because my
laptop would have crashed if I tried to train on anything >100000 rows (roughly)
. I'm going to use the
ScienceBox along with the
sb client to train a much
Since it's going to take a little while, I'm going to tell the
to send me an email when it's finished running.
I hope this gave you some insight into how you can develop your own fuzzy matching algorithms without having to spend lots of time in R&D mode. Here are some other helpful resources for reconciling datasets:
- Approximate String Matching (Wikipedia)
- Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
- CDC Resources
- Probabilistic Record Linkage
And for more information on ScienceBox: