The Yhat Blog

machine learning, data science, engineering

Naive Bayes in Python

by Greg |

Naive Bayes classification is a simple, yet effective algorithm. It's commonly used in things like text analytics and works well on both small datasets and massively scaled out, distributed systems.

How does it work?

Naive Bayes is based on, you guessed it, Bayes' theorem. Think back to your first statistics class. Bayes' theorem was that seemingly counterintuitive lecture on conditional probability.

I've tried to buy this on multiple occasions, but to no avail :(. Please email me if you know where you can buy it

Bayes' Theorem

The neon formula above might look intimidating, but it's actually not that complicated. To explain it, instead of using "events A and B", I'm going to use something a little more familiar. Let's say the two events in question are:

A) I watched The Lego Movie today
B) I sat on the couch today

So for my 2 events, let's break it down into it's Bayesian components:

I've seen The Lego Movie 10 times in the past 2 months--this is a lot, I know. I've been lucky enough that it's been playing on almost every plane I've been on (as it should be! it's a great movie for both adults and kids). Since I've watched The Lego Movie 10 out of the last 60 days, we'll say that:

P(A) = P(I watched The Lego Movie today) = 10 / 60, or ~0.17

I sit on the couch most days I'm at my apartment. I've traveled 14 days in the past 2 months, so to keep it simple, we'll assume I sat on my couch at least once on every other day (hey, it's pretty comfy).

P(B) = P(I sat on the couch today) = (60 - 14) / 60, or ~0.76

I've seen The Lego Movie 10 times and 4 of those times have been on a plane. I think it's pretty safe to assume the rest of those times I was seated comfortably in my living room. So given that I've had 46 days of couchtime in the past 2 months, we can say that I watched The Lego Movie from my couch 6 / 10 times.

P(B|A) = P(I sat on the couch given that I watched The Lego Movie) = 6 / 10 = 0.60

Ok, ready for the magic! Using Bayes' theorem, I now have everything I need to calculate the Probability that I watched The Lego Movie today given that I sat on the couch.

P(A|B)=P(B|A)*P(A)/P(B) = (0.60 * 0.17) / 0.76

P(I watched The Lego Movie given that I sat on the couch) = 0.13

And voilà! Given that I sat on the couch today, there is a 13% chance that I also watched The Lego Movie (wow, that's a lot of Lego time).

Now I wonder what the probability of me watching The Lego Movie from a double decker couch would be?

Why should I use it?

Where you see Naive Bayes classifiers pop up a lot is in document classification. Naive Bayes is a great choice for this because it's pretty fast, it can handle a large number of features (i.e. words), and it's actually really effective. Take a look at what happens when you do some basic benchmarking between Naive Bayes and other methods like SVM and RandomForest against the 20 Newsgroups dataset.

Naive Bayes wins! Granted this is a relatively simple approach without much in terms of feature engineering, but in my opinion that's part of the beauty of Naive Bayes!

Code for benchmarking is available here.

Document Classification

For our example we're going to be attempting to classify whether a wikipedia page is referring to a dinosaur or a cryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).

My favorite Yeti, The Bumble, from the stop-motion holiday classic Rudolph the Red-nosed Reindeer

We'll be using the text from each wikipedia article as features. What we'd expect is that certain words like "sighting" or "hoax" would be more commonly found in articles about cryptozoology, while words like "fossil" would be more commonly found in articles about dinosaurs.

We'll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.

You can find the sample documents I used and the corresponding code here.

Tokenizing and counting

First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we'll be counting how many times each word appears in each document. We'll also perform a little text normalization by removing punctuation and lowercasing the text (this means "Hello," and "hello" will now be considered the same word).

Once we've cleaned the text, we need a way to delineate words. A simple approach is to just use a good 'ole regex that splits on whitespace and punctuation: \W+.

Calculating our probabilities

So now that we can count words, let's get cooking. The code below is going to do the following:

  • open each document
  • label it as either "crypto" or "dino" and keep track of how many of each label there are (priors)
  • count the words for the document
  • add those counts to the vocab, or a corpus level word count
  • add those counts to the word_counts, for a category level word count

Classifying a new page

And finally it's time for the math. We're going to use the word counts we calculated in the previous step to calculate the following:

Prior Probability for each category, or for the layman, the percentage of documents that belong to each category. We have 9 crypto docs and 8 dino docs, so that gives us the following:

Prior Prob(crypto) = 9 / (8 + 9) = 0.53

Prior Prob(dino) = 8 / (8 + 9) = 0.47

Ok priors, check. The next thing we need are conditional probabilities for the words in the document we're trying to classify. How do we do that? Well we start by doing a word count on a new document. We'll use the Yeti page as our new document.

Alright, we've got our counts. Now we'll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we're going to perform the operations in logspace. All this means is we're going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.

Since we're slightly bending the rules of Bayes' Theorem, the results are not actual probabilities, but rather are "scores". All you really need to know is which one is bigger. So our suspicions are confirmed, the "Yeti.txt" file is being classified overwhelmingly in favor of crypto (as we would hope).

Bringing it home, the LEGO Yeti!

Final Thoughts

You can find all the code and documents used in this post on GitHub.

Naive Bayes is great because it's fairly easy to see what's going on under the hood. It's a great way to start any text analysis and it can easily scale out of core to work in a distributed environment. There are some excellent implementations in the Python community you can use as well, so if you don't want to roll your own, have no fear! The scikit-learn and nltk versions are great places to start.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.