Naive Bayes classification is a simple, yet effective algorithm. It's commonly used in things like text analytics and works well on both small datasets and massively scaled out, distributed systems.
How does it work?
Naive Bayes is based on, you guessed it, Bayes' theorem. Think back to your first statistics class. Bayes' theorem was that seemingly counterintuitive lecture on conditional probability.
The neon formula above might look intimidating, but it's actually not that
complicated. To explain it, instead of using "events
B", I'm going to
use something a little more familiar. Let's say the two events in question are:
A) I watched The Lego Movie today B) I sat on the couch today
So for my 2 events, let's break it down into it's Bayesian components:
I've seen The Lego Movie 10 times in the past 2 months--this is a lot, I know. I've been lucky enough that it's been playing on almost every plane I've been on (as it should be! it's a great movie for both adults and kids). Since I've watched The Lego Movie 10 out of the last 60 days, we'll say that:
P(A) = P(I watched The Lego Movie today) = 10 / 60, or ~0.17
I sit on the couch most days I'm at my apartment. I've traveled 14 days in the past 2 months, so to keep it simple, we'll assume I sat on my couch at least once on every other day (hey, it's pretty comfy).
P(B) = P(I sat on the couch today) = (60 - 14) / 60, or ~0.76
I've seen The Lego Movie 10 times and 4 of those times have been on a plane. I think it's pretty safe to assume the rest of those times I was seated comfortably in my living room. So given that I've had 46 days of couchtime in the past 2 months, we can say that I watched The Lego Movie from my couch 6 / 10 times.
P(B|A) = P(I sat on the couch given that I watched The Lego Movie) = 6 / 10 = 0.60
Ok, ready for the magic! Using Bayes' theorem, I now have everything I need to calculate the Probability that I watched The Lego Movie today given that I sat on the couch.
P(A|B)=P(B|A)*P(A)/P(B) = (0.60 * 0.17) / 0.76
P(I watched The Lego Movie given that I sat on the couch) = 0.13
And voilà! Given that I sat on the couch today, there is a 13% chance that I also watched The Lego Movie (wow, that's a lot of Lego time).
Now I wonder what the probability of me watching The Lego Movie from a double decker couch would be?
Why should I use it?
Where you see Naive Bayes classifiers pop up a lot is in document classification. Naive Bayes is a great choice for this because it's pretty fast, it can handle a large number of features (i.e. words), and it's actually really effective. Take a look at what happens when you do some basic benchmarking between Naive Bayes and other methods like SVM and RandomForest against the 20 Newsgroups dataset.
Naive Bayes wins! Granted this is a relatively simple approach without much in terms of feature engineering, but in my opinion that's part of the beauty of Naive Bayes!
Code for benchmarking is available here.
For our example we're going to be attempting to classify whether a wikipedia page is referring to a dinosaur or a cryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).
We'll be using the text from each wikipedia article as features. What we'd expect is that certain words like "sighting" or "hoax" would be more commonly found in articles about cryptozoology, while words like "fossil" would be more commonly found in articles about dinosaurs.
We'll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.
Tokenizing and counting
First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we'll be counting how many times each word appears in each document. We'll also perform a little text normalization by removing punctuation and lowercasing the text (this means "Hello," and "hello" will now be considered the same word).
Once we've cleaned the text, we need a way to delineate words. A simple approach
is to just use a good 'ole regex that splits on
whitespace and punctuation:
Calculating our probabilities
So now that we can count words, let's get cooking. The code below is going to do the following:
- open each document
- label it as either "crypto" or "dino" and keep track of how many of each label
there are (
- count the words for the document
- add those counts to the
vocab, or a corpus level word count
- add those counts to the
word_counts, for a category level word count
Classifying a new page
And finally it's time for the math. We're going to use the word counts we calculated in the previous step to calculate the following:
Prior Probability for each category, or for the layman, the percentage of documents that belong to each category. We have 9 crypto docs and 8 dino docs, so that gives us the following:
Prior Prob(crypto) = 9 / (8 + 9) = 0.53
Prior Prob(dino) = 8 / (8 + 9) = 0.47
Ok priors, check. The next thing we need are conditional probabilities for the words in the document we're trying to classify. How do we do that? Well we start by doing a word count on a new document. We'll use the Yeti page as our new document.
Alright, we've got our
counts. Now we'll calculate
P(word|category) for each word and multiply each of these conditional probabilities
together to calculate the
P(category|set of words). To prevent computational errors, we're going to perform the operations in logspace. All
this means is we're going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.
Since we're slightly bending the rules of Bayes' Theorem, the results are not actual probabilities, but rather are "scores".
All you really need to know is which one is bigger.
So our suspicions are confirmed, the "Yeti.txt" file is being classified overwhelmingly in favor of
crypto (as we would hope).
You can find all the code and documents used in this post on GitHub.
Naive Bayes is great because it's fairly easy to see what's going on under the hood. It's a great way to start any text analysis and it can easily scale out of core to work in a distributed environment. There are some excellent implementations in the Python community you can use as well, so if you don't want to roll your own, have no fear! The scikit-learn and nltk versions are great places to start.