The Yhat Blog

machine learning, data science, engineering

In natural language processing, entity recognition problems are those in which the principal task is to identify irreducible elements like people, places, locations, products, companies, and measurements within a body of text.

"THE ULTIMATE GOAL of research on Natural Language Processing is to parse and understand language."

-Manning & Schiitze, Foundations of Statistical Natural Language Processing

Tools for NLP have become extremely accessible via open source projects like Python's Natural Language Toolkit, and there are seemingly endless applications for processing human language given the troves of structured and unstructured text available via the web, social media, email, sms, digital books, etc.

This is a post about natural language processing using Python's NLTK project.

The Dataset

As an example, we'll look at text corpuses from The dataset consists of user-submitted recaps of Law and Order episodes and other metadata about law and order obtained from wikipedia.

df = pd.read_csv('./episodes_and_recaps.txt', sep='|')
print 'nrow: %d' % len(df)
print 'ncol: %d' % len(df.columns)
nrow: 775
ncol: 13

Int64Index: 775 entries, 0 to 774
Data columns (total 13 columns):
directed_by            775  non-null values
no_in_season           775  non-null values
no_in_series           775  non-null values
original_air_date      775  non-null values
production_code        680  non-null values
title                  775  non-null values
us_viewers_millions    428  non-null values
written_by             775  non-null values
nth_season             775  non-null values
show                   775  non-null values
corpus_url             775  non-null values
source                 775  non-null values
corpus                 167  non-null values
dtypes: float64(1), int64(3), object(9)
print 'Number of episodes in the Law and Order franchise'

Number of episodes in the Law and Order franchise
original    456
svu         319
dtype: int64

Because these are submitted by users, not every episode will have a recap / corpus, so we'll need to remember to drop rows that are missing the `corpus` column.

text = df[df.corpus.notnull()].corpus.values[9]
print text[:500]

After a woman called Angela Jarrell is discovered dead in front of a bakery, her purse, which the detectives come across two blocks away, is found to contain a bag of ecstasy. From the victim's stomach contents, they place her in a bar, where a waitress finds a credit card slip from a young man who paid for her drinks. This man turns out to be innocent, but he gives the detectives the number of Jarrell's (stolen) cell phone, which leads them to a Mr. Daltry, who claims he was trying to purchase 

Extracting names or places from the text

Suppose we wanted to know the neighborhoods, parks, street corners, and other locations where crimes are most often committed in Law and Order or SVU.

We could read all of the episode summaries on and look for neighborhoods mentioned. But there are 456 episodes spanning 20 seasons in the original Law and Order series and 319+ episodes spanning 15 seasons in Special Victims Unit. That's a lot of reading to do, not to mention that you're likely to miss at least some location names.

Using NLTK and Python

For this example, we'll write a script to extract the named entities from these episode recaps programatically.

Our general strategy will be to transform our unstructured text data into structured data. To accomplish this, we'll be using several utilities found in Python's NLTK (Natural Language Toolkit) library, a package with lots of great functions and routines for tokenizing and learning text.

Tokenize text into sentences

First things first. We need to break the corpus into individual sentences. This can be done using NLTK's sent_tokenize function (read more here).


The sent_tokenize function is pretty cool. Given some text, it returns the individual sentences as a Python list.

sentences = nltk.sent_tokenize(text)
print 'Original sentence\n'
print text[ text.index("This man turns out"): text.index("Frances Partell")+len("Frances Partell")+1]
print 'tokenized\n'
for i, sent in enumerate(sentences[2:]):
    print i, sent

Original sentence
This man turns out to be innocent, but he gives the detectives the number of Jarrell's (stolen) cell phone, which leads them to a Mr. Daltry, who claims he was trying to purchase ecstasy from her and calls her an incompetent drug dealer. Daltry points the detectives to a drug dealer known as Taz, whose real name is Frances Partell.

0 This man turns out to be innocent, but he gives the detectives the number of Jarrell's (stolen) cell phone, which leads them to a Mr. Daltry, who claims he was trying to purchase ecstasy from her and calls her an incompetent drug dealer.

1 Daltry points the detectives to a drug dealer known as Taz, whose real name is Frances Partell.While the detectives can find nothing connecting Partell with Jarrell's murder, they discover from his former partner, Mr. Quintana, that he was involved in the murder of a bouncer at a club in the Bronx in 1998--for which another man, Tony Shaeffer, was sentenced after he boasted to his girl-friend of the murder.

2 The detectives re-interview the witnesses of the bouncer murder and arrest Partell for it.Unfortunately, this arrest raises serious interoffice issues with the detectives, prosecutors, and DAs in the Bronx-the only reason the Manhattan office is allowed to prosecute the case in the first place is because the club is 488 yards from the county line.

3 In spite of various witnesses raising doubts about the case, DA Robertson in the Bronx refuses to reconsider the case.

4 The Manhattan DAs take the case to court nevertheless and manage to win an evidentiary hearing.

5 Lewin asks McCoy and Carmichael to strike a deal with Partell in which he admits his guilt to get Shaeffer out of jail, but even after Partell's confession DA Robertson refuses to budge.

6 McCoy takes the case to the Appellate Court and manages to win freedom for Shaeffer.

Notice that NLTK knows that the period in "Mr. Daltry" (first sentence listed above) isn't the end of a sentence? It'll also handle sentences that start w/ lowercase letters.

As a rule, NLTK does a really good job of tokenization without a lot of fine tuning. If you do have some specific text that demands special behavior around tokenization, there are a lot of great options for adjusting and overriding the default behavior too.

Tokenize sentences into words

Next, we need to tokenize each sentence into its individual words.

tokenized = [nltk.word_tokenize(sentence) for sentence in sentences]
print "\nFirst 20 words of the first sentence\n"
print tokenized[0][:20]

First 20 words of the first sentence

['After', 'a', 'woman', 'called', 'Angela', 'Jarrell', 'is', 'discovered', 'dead', 'in', 'front', 'of', 'a', 'bakery', ',', 'her', 'purse', ',', 'which', 'the']

Label Parts of Speech

Finally, we need to label each word with its part of speech. This will enable us to discern nouns (and proper nouns) from everything else later on.

A lot can--and has--been said about Part-of-speech tagging. Since many of the details are outside the scope of this blog post, I'll go through some of the basics of POS tagging using NLTK and leave some cool references at the end for anybody interested to read more.

NLTK's pos_tag function

NLTK's pos_tag is NLTK's primary off-the-shelf tagger for parts of speech. It relies on the Penn Treebank tagset and encodes a list of tokens as tuples with shape (token, part_of_speech).


If you don't have the Penn Treebank tagset installed, you can get it using NLTK's built-in downloader tool like so:

# import nltk

To illustrate how to use this function, let's take the following sentence as an example:

a = "Alan Shearer is the first player to score over a hundred Premier League goals."
a_sentences = nltk.sent_tokenize(a)
a_words     = [nltk.word_tokenize(sentence) for sentence in a_sentences]
a_pos       = [nltk.pos_tag(sentence) for sentence in a_words]

[[('Alan', 'NNP'),
  ('Shearer', 'NNP'),
  ('is', 'VBZ'),
  ('the', 'DT'),
  ('first', 'JJ'),
  ('player', 'NN'),
  ('to', 'TO'),
  ('score', 'VB'),
  ('over', 'IN'),
  ('a', 'DT'),
  ('hundred', 'CD'),
  ('Premier', 'NNP'),
  ('League', 'NNP'),
  ('goals', 'NNS'),
  ('.', '.')]]

Take a look at the use of the word "over" in the above sentence. The 'IN' tag in the tuple ('over', 'IN') indicates that it's being used as a preposition in the phrase "over a hundred."


b = "Hank Mardukas was over-served at the bar last night."
b_sentences = nltk.sent_tokenize(b)
b_words     = [nltk.word_tokenize(sentence) for sentence in b_sentences]
b_pos       = [nltk.pos_tag(sentence) for sentence in b_words]

[[('Hank', 'NNP'),
  ('Mardukas', 'NNP'),
  ('was', 'VBD'),
  ('over-served', 'JJ'),
  ('at', 'IN'),
  ('the', 'DT'),
  ('bar', 'NN'),
  ('last', 'JJ'),
  ('night', 'NN'),
  ('.', '.')]]

This time, "over-served" is tagged as 'JJ' (adjective). NLTK knows that "over" is part of the attributive adjective phrase describing Hank and the potentially embarrassing state in which he found himself at the bar last night.

We can apply this routine to our Law and Order episode recaps like so:

pos_tags  = [nltk.pos_tag(sentence) for sentence in tokenized]
print "First 10 (word, parts of speech) in the first sentence\n"
print pos_tags[0][:10]

First 10 (word, parts of speech) in the first sentence

[('After', 'IN'), ('a', 'DT'), ('woman', 'NN'), ('called', 'VBN'), ('Angela', 'NNP'), ('Jarrell', 'NNP'), ('is', 'VBZ'), ('discovered', 'VBN'), ('dead', 'JJ'), ('in', 'IN')]

Here's a list of parts-of-speech abbreviations for reference.

CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Extracting the Entities

NLTK gives us some really powerful methods for isolating entities in text. One of the simplest and most powerful tools at our disposal is the batch_ne_chunk function which takes a list of tagged tokens and returns a list of named entity chunks.

You can chunk sentences by passing sentences that have been tagged with parts-of-speech to batch_ne_chunk.

named_entity_chunks =  nltk.batch_ne_chunk(pos_tags)
print sentences[0]
print named_entity_chunks[0][:9]
After a woman called Angela Jarrell is discovered dead in front of a bakery, her purse, which the detectives come across two blocks away, is found to contain a bag of ecstasy.
[('After', 'IN'), ('a', 'DT'), ('woman', 'NN'), ('called', 'VBN'), Tree('PERSON', [('Angela', 'NNP'), ('Jarrell', 'NNP')]), ('is', 'VBZ'), ('discovered', 'VBN'), ('dead', 'JJ'), ('in', 'IN')]
print 'List of tagged tokens'
print [nltk.pos_tag(sentence) for sentence in tokenized][:1][0][:6]
print 'List of entity chunks'
print nltk.batch_ne_chunk(pos_tags)[:1][0][:6]
List of tagged tokens
[('After', 'IN'), ('a', 'DT'), ('woman', 'NN'), ('called', 'VBN'), ('Angela', 'NNP'), ('Jarrell', 'NNP')]

List of entity chunks
[('After', 'IN'), ('a', 'DT'), ('woman', 'NN'), ('called', 'VBN'), Tree('PERSON', [('Angela', 'NNP'), ('Jarrell', 'NNP')]), ('is', 'VBZ')]

This is what we're after:

Tree('PERSON', [('Angela', 'NNP'), ('Jarrell', 'NNP')])

NLTK recognizes "Angela" and "Jarrell" as names, though it fails to identify them as one name ("Angela Jarrell"). If you wanted to treat First and Last names as a single name, there are ways to tune your classifier. Depending on the behavior you're shooting for, there could be a few ways to do it, so look to the NLTK docs for specifics.

Pause for a second

So far we:

  • Took corpus of text and split it up into sentences using sent_tokenize
  • Split up each tokenized sentence into word tokens using word_tokenize
  • Tagged each part of speech using pos_tag
  • Converted the tagged parts of speech tokens into entity chunks using batch_ne_chunk

Let's wrap this routine into reusable functions.

Helper to read `corpus` column without the other columns

Pretty simple. This'll let us read the top n non-null corpuses quickly.

# helper function to read in text corpuses only
def read_texts(f='./episodes_and_recaps.txt', n_samples=5):
    "returns non-null text corpuses for the top n rows"
    df = pd.read_csv(f,sep='|')
    df = df[df.corpus.notnull()]
    corpuses = df.corpus.head(n_samples).tolist()
    return corpuses

Take in raw text. Output tagged entity chunks.

Just copied and pasted the lines we already wrote into one function. This puts a corpus thru the 4 operations we did above text => sentences => words => parts of speech => entity chunks.

def parts_of_speech(corpus):
    "returns named entity chunks in a given text"
    sentences = nltk.sent_tokenize(corpus)
    tokenized = [nltk.word_tokenize(sentence) for sentence in sentences]
    pos_tags  = [nltk.pos_tag(sentence) for sentence in tokenized]
    return nltk.batch_ne_chunk(pos_tags, binary=True)

Find all the unique named entities.

This one will extract named entities from the entity chunks.

def find_entities(chunks):
    "given list of tagged parts of speech, returns unique named entities"

    def traverse(tree):
        "recursively traverses an nltk.tree.Tree to find named entities"
        entity_names = []
        if hasattr(tree, 'node') and tree.node:
            if tree.node == 'NE':
                entity_names.append(' '.join([child[0] for child in tree]))
                for child in tree:
        return entity_names
    named_entities = []
    for chunk in chunks:
        entities = sorted(list(set([word for tree in chunk
                            for word in traverse(tree)])))
        for e in entities:
            if e not in named_entities:
    return named_entities

Test it out

text = read_texts(n_samples=1)[0][:500]
print text
David Moore, a wealthy Manhattanite, comes rushing into the lobby of his apartment building, carrying his comatose wife Joan in his arms. In the hospital, Joan Moore is diagnosed as suffering from insulin shock, even though she is not a diabetic. On further investigation, Green and Briscoe find out that Joan is actually suffering from Parkinson's disease?and, unbeknownst to her husband, has been receiving treatment from Dr. Richard Shipman?and suspect that her husband is either attempting to
entity_chunks  = parts_of_speech(text)
['David Moore',
 'Joan Moore',
 'Bertrand Stokes',

This is cool. Now what?

Applications for text-based entity extraction are far ranging, from exploring product conversations on Facebook to uncovering terrorist plots in emails, to analyzing free-response answers in customer surveys.

Once you've built and tuned a model for your particular use-case, you can use it to power your app or use it within your CRM via Yhat.

Wrap the code you already wrote in a class

Define a subclass of yhat.BaseModel. Implement require, transform, and predict as usual. Here we require NLTK but not numpy or pandas since Yhat will load those for us by default.

from yhat import Yhat, BaseModel

class NamedEntityFindr(BaseModel):
    def require(self):
        import nltk

    def transform(self, raw):
        "uses the parts_of_speech function we wrote earlier"
        rows = pd.Series(raw['data'])
        rows = rows.apply(parts_of_speech)
        return rows

    def predict(self, chunked):
        "uses the find_entities function we wrote earlier"
        res = chunked.apply(find_entities).values
        return {'entities': res.tolist()[0]} # returns a nice dictionary

Create an instance of that class

One super helpful feature of Yhat is that you can use any helper/utility functions you've written within your class.

If you've referenced any functions from other parts of your script like the two we wrote, parts_of_speech and find_entities, you can pass those to your classifier when you create it by using the udfs argument. UDF is short for user defined function.

clf = NamedEntityFindr(
    udfs=[find_entities, parts_of_speech]

This lets you explicitly tell Yhat which functions you want to use in production.

Test it out locally

data = {'data': [text]}
print data
{'data': ["David Moore, a wealthy Manhattanite, comes rushing into the lobby of his apartment building, carrying his comatose wife Joan in his arms. In the hospital, Joan Moore is diagnosed as suffering from insulin shock, even though she is not a diabetic. On further investigation, Green and Briscoe find out that Joan is actually suffering from Parkinson's disease\xe2\x80\x94and, unbeknownst to her husband, has been receiving treatment from Dr. Richard Shipman\xe2\x80\x94and suspect that her husband is either attempting to kill her or helping her commit suicide.Trying to trace the source of the insulin, the detectives run into the psychiatrist Bertrand Stokes, who has previously smuggled that substance into the country. Stokes' wife, however, explains that the insulin is part of a sex game in which a number of men inject their wives with insulin, have intercourse with them, tape the entire event, and swap the tapes among each other. David Moore admits that this is indeed happening, but claims he kept it from the police in order to spare his wife the embarrassment. Because he is accused of murdering Joan, David loses custody of her (and her fortune) to Joan's daughter, Debbie Mann.On further investigation, it turns out that Joan's Parkinson's disease is not natural, but induced by the drug MPTP. Remnants of this drug are subsequently found in Debbie's office, and it turns out that her company is on the brink of bankruptcy and is only kept afloat by an infusion of Joan's money. Green and Briscoe find out that Debbie and Shipman, who previously did research on MPTP, know each other and get Debbie to confess that the two conspired to slowly poison Joan. However, Debbie claims she pulled out of the arrangement at the last minute and that Shipman administered the fatal dosage of MPTP on his own. With the threat of reviving Joan through the drug L-dopa, Shipman finally confesses that this scenario is indeed true."]}

On your local machine, you need to transform your data using the transform function you wrote before you can call predict. In production, Yhat will do the transformation for you.

print 'Results on my local machine'
transformed = clf.transform(data)
results = clf.predict(transformed)
print results
Results on my local machine
{'entities': ['David Moore', 'Joan Moore', 'Bertrand Stokes', 'Briscoe', 'Green', 'Joan', 'Parkinson', 'Richard', 'David', 'Debbie', 'MPTP', 'Shipman']}

Deploy to Yhat

print yh.upload("NamedEntityFindr", clf)
uploading... done!
{u'status': u'success', u'modelname': u'NamedEntityFindr', u'version': 1}
[model for model in yh.show_models()['models'] if model['name'] == "NamedEntityFindr"]
[{u'className': u'NamedEntityFindr',
  u'name': u'NamedEntityFindr',
  u'username': u'austin',
  u'version': 1}]
results_from_server = yh.raw_predict("NamedEntityFindr", 1, data)

And here are the results from the model deployed to Yhat.

{u'execution_time': 1.5051939487457275,
 u'model': u'NamedEntityFindr',
 u'prediction': {u'entities': [u'David Moore',
   u'Joan Moore',
   u'Bertrand Stokes',
 u'run_date': 1372907338,
 u'status': u'success',
 u'user': u'austin',
 u'version': 1}
print 'sanity check.'
print 'results all match => %s' \
    % np.all(np.array(results['entities']) == np.array(results_from_server['prediction']['entities']))
sanity check.
results all match => True

Give it a try!

Episode Recap from

  • Show: SVU
  • Season: 11
  • Episode: 3
  • Title: Solitary
  • Corpus URL:

  • Episode Recap Text:

    A young woman, Lily, is kidnapped from her apartment. Her boyfriend comes home and finds out what happens. He assumes their creepy neighbor kidnapped his girlfriend. Elliot and Olivia knock on the guy's door and find a guy named Donovan that Elliot had put into jail decades before. It is eventually revealed that his lawyer is Jessica Walter.Lily is found floating in the Hudson River. She is still alive. She wakes up and can't remember the specifics of what happened to her. She was pretty certain their creepy neighbor is the one that kidnapped her. Things turn around when a women working at a copy store tells the police that she saw Lily arguing with an Asian man around midnight, after she was supposedly taken. It turns out that Lily has a drug problem she wanted to keep a secret from her boyfriend. Elliot went to apologize to Donovan. Donovan was mistaken as to why Elliot went to find him and pushes Elliot off the roof.At the trial, there was a debate over the treatment of the prisoners in solitary confinement. Donovan challenges Elliot to understand what it is like to be completely alone. To refute this, Elliot decides to go into solitary confinement himself. Elliot goes crazy only being in for three days and he now understands what Donovan went through. Olivia is still not convinced and is unwilling to drop the charges.

    Paste text into the form to use the entity finder we just deployed.

    You can play with the standalone app here:

    Final Thoughts

    Download Law and Order dataset
    Web crawlers to build the dataset yourself on github

    Other references

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.