The Yhat Blog

machine learning, data science, engineering

Many of you are likely already familiar with Kaggle, but for those who don't know much about them, it's a super interesting business that connects companies with data science problems to data enthusiasts who compete to come up with the best solutions. Business problems get solved and solved well, participants learn new methods and earn bragging rights, and contest winners take home cash. Plus it's fun.

Competitions are diverse. Some are hobby-ish; like predicting passenger survival on the Titanic. And many have real-world implications both for the participants and for companies (e.g. winning a job at Facebook).

This is a post about a Kaggle challenge that recently concluded and my experience tossing my hat into the ring.

Partly sunny with a chance of hashtags

Contestants in this competition were asked to identify the extent to which Twitter data could be used to forecast the weather. Right off the bat, the challenge reminded me of Google's Flu Trends, a project I find incredibly interesting, so I was fired-up to take a stab at it.

Google Flu Trends Screenshot

Just as predicting flu activity has utility in public health, the ability to accurately estimate the weather using a freely available data set such as Twitter seems to have compelling implications in both public and consumer domains (e.g. safety, fire, traffic and road conditions, etc.).

Data set

In conjunction with Crowd Flower, Kaggle compiled a dataset of tweets related to the weather which you can download here.

In [1]:
import pandas as pd
train_data = pd.read_csv('./data/train.csv')
id tweet state location s1 s2 s3 s4 s5 w1 w2 w3 w4 k1
0 1 Jazz for a Rainy Afternoon: {link} oklahoma Oklahoma 0 0 1 0.000 0.000 0.800 0 0.200 0 0
1 2 RT: @mention: I love rainy days. florida Miami-Ft. Lauderdale 0 0 0 1.000 0.000 0.196 0 0.804 0 0
2 3 Good Morning Chicago! Time to kick the Windy C... idaho NaN 0 0 0 0.000 1.000 0.000 0 1.000 0 0
3 6 Preach lol! :) RT @mention: #alliwantis this t... minnesota Minneapolis-St. Paul 0 0 0 1.000 0.000 1.000 0 0.000 0 0
4 9 @mention good morning sunshine rhode island Purgatory 0 0 0 0.403 0.597 1.000 0 0.000 0 0

The data consists of raw tweets, geography (state and location), plus a series of numeric columns (not so helpfully) named s1, s2, s3, etc. Kaggle/Crowd Flower employed a cog army to tag each tweet ahead of time.

Human raters assigned tags to each tweet across 3 categories:

  • sentiment (s) - the general mood of the tweet (i.e. "This rain sucks" would be negative)
  • when (w) - a temporal interpretation of the time weather did or will occur (i.e. "It's raining now" vs. "It looks like it's going to rain later")
  • kind (k) - the type of weather referenced in the tweet (i.e. "The snowflakes are huge! would be encoded as snow")

The data is arranged with each category / factor level in its own column. Since multiple human raters tagged each tweet, the columns represent an aggregate of all labels submitted by the crowd.

In [2]:
data_dict = pd.read_csv("./data/dictionary.csv")
column definition for humans
0 s1 I can't tell
1 s2 Negative
2 s3 Neutral / author is just sharing information
3 s4 Positive
4 s5 Tweet not related to weather condition
5 w1 current (same day) weather
6 w2 future (forecast)
7 w3 I can't tell
8 w4 past weather
9 k1 clouds
10 k2 cold
11 k3 dry
12 k4 hot
13 k5 humid
14 k6 hurricane
15 k7 I can't tell
16 k8 ice
17 k9 other
18 k10 rain
19 k11 snow
20 k12 storms
21 k13 sun
22 k14 tornado
23 k15 wind

So as an example, let's find people who are currently in a hurricane. We'll create a basic filter using pandas that will grab everyone with w1 (current weather) and k6 (hurricane) greater than 0.7. Let's take a look at the first tweet we get:

In [6]:
pd.options.display.max_colwidth = 125
In [7]:
mask = (train_data.w1 > 0.7) & (train_data.k6 > 0.7)
864    Extra LARGE Marble size hail is currently falling here in Orlando, Florida during this rain storm.  Hurricane season star...
Name: tweet, dtype: object

Looking at individual tweets is cool, but let's look at these encoded variables look on the whole.

In []:
df_long = pd.melt(train_data[train_data.columns[4:]])

Let's grab the human-friendly variable names from the data dictionary file to make our plots more readable.

In [9]:
df_long = pd.merge(data_dict, df_long, left_on="column", right_on="variable")

Since the data set contains so many columns, I found it handy to use a facet/trellis plot to plot different subsets of the data all at once.

In [8]:
from IPython.core.display import Image
from ggplot import *
In [10]:
p = ggplot(aes(x='value'), data=df_long) + \
    geom_histogram(binwidth=0.1) + \
    facet_wrap("definition for humans")
# workaround for issue w/ ggplot! (
ggsave(p, "./encoded_variable_hists.png")
facet_wrap scales not yet implemented!
Saving 11.0 x 8.0 in image.


A number of variables are mostly zeros which intuitively makes a lot of sense. After all, it's not everyday that there's a hurricane or an ice storm outside. This is likely why there's less chatter about those types of weather than there is about, say, sunny, stormy or hot weather. Likewise, people seem to talk about the current weather and the future weather more so than they do weather in the past.

Applying Machine Learning to Text

There are a few machine learning techniques which operate solely on text (e.g. Bayesian spam filters; however, most methods rely on numerical features. In order to evaluate different learning methods for the prediction task at hand, I started by transforming text-based tweets into a numerical feature matrix.

That is to say, a 2D array where numerical columns represents features and rows represents observations. Each cell, therefore, holds the value of that feature for that observation. If we can accomplish this with our Twitter data, then we can start using a diverse range of algorithms. Or as Conway and White put it:

"As long as we assume we're working with rectangular arrays, we can use powerful mathematical techniques without having to think very carefully about the actual mathematical operation being performed."

Drew Conway, John White; Machine Learning for Hackers

Great! Not thinking very carefully is what I do best.

A common way to transform text in to the classic machine learning rectangle is to count up the words using a bag of words approach. Let's consider a simplified version of this problem, where our entire corpus is these two tweets:

(1) "The cat in the hat"
(2) "The dog ate the cat"

Now we just count the words:

id 'the' 'cat' 'in' 'hat' 'dog' 'ate' 
(1)  2     1     1    1     0     0 
(2)  2     1     0    0     1     1 

...and we have our rectangle! Wow, that sure was easy, but we're not quite done yet.

Consider words like 'the', and 'in'. In natural language processing these are called "stop words". Stop words are incredibly prevalent in human language and intuitively hold little intrinsic information. Consequently, these words are often removed for unigram analysis (which is what we're going to do here).

If you're unfamiliar with the jargon, ungrams are single words devoid of context. Likewise, "bigrams" consist of two words, like 'the dog', 'dog ate', 'ate the', 'the cat' and "trigrams" are three get the idea. You can actually build pretty robust language models just by counting unigrams, bigrams and trigrams, so check out the resources at the bottom if you're interested to read more about the various methods.

The universe of all unique words found in all of our tweets combined (aka our "vocabulary") amounts to approximately 55,000 unique words (your tokenizer's results may vary slightly). Each word becomes a column in our matrix, but since tweets only allow 140 characters, each row in the matrix ends up having a dozen or so unique words. This means that more than 99.99% of cells in our matrix zeros. If we're using all 2^3 bytes of an 64 bit integer to store each 0, suddenly our 77,946 x 55,000 feature matrix takes up roughly 2^35 bytes or 32 gigabytes of active memory. That seems... unnecessary.

Finally, you'll notice I made some assumptions about what constitutes as unique word by grouping 'The' and 'the' together. Worry not, it'll get even more complicated.

scikit-learn FTW

Fortunately for us, the scikit-learn module CountVectorizer does what we want in one shot. And I mean everything.

This routine takes a list of raw text documents, removes stop words, converts characters to lower case, and turns each tweet into a list of words (or bigrams, trigrams, or n-grams if we wanted it to). Ultimately, it produces a matrix which looks a lot like the one above. The transformed data resulting will be a scipy.sparse matrix which is far more efficient for storing big and sparse data in memory. Hell yeah!

A particularly useful function of CountVectorizer is that it allows you to specify your own strategy for how you want your text documents to be broken up into sentences, words, or characters. If you want to provide your own tokenizer, simply include it as an argument when you initialize your CountVectorizer, otherwise a default tokenizer will be used. As an example, let's use the nltk.word_tokenizer that Austin used in the Named Entities blog post.

NOTE: I'm only going to consider the top 3000 most common words at least to start off.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

raw_tweets = train_data['tweet'].tolist()

vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize,

# Train the vectorizer on our vocabulary

# Make a rectangle
x_test = vectorizer.transform(raw_tweets)

<77946x3000 sparse matrix of type '<type 'numpy.int64'>'
	with 769165 stored elements in Compressed Sparse Row format>

Now we have a feature matrix we can use any of the great models included in scikit-learn. Unfortunately, because our feature matrix is so huge and because of our need to train 24 different classifiers (one for each of our columns), using cross validation for testing is going to take a while. This is where we have to put on our lab coats and earn the "scientist stripes" that come w/ the name data scientist. And, as my friends who've worked in research labs tell me, sometimes being a scientist is really boring.

Fire up a couple EC2 instances, line-up a bunch of different algorithms to try, and go grab lunch. When you find a good algorithm, upload it to yhat and keep testing other ones. Don't worry, we'll keep it safe and sound.

Note: I'm giving each model a unique name according to the target variable it's designed to predict. That'll be important later.

In []:
from sklearn.linear_model import ElasticNet
from yhat import Yhat, BaseModel

yh = Yhat("USERNAME","APIKEY")

# Define a yhat classifier
class TweetClassifier(BaseModel):
    def transform(self, raw_tweets):
        import nltk
        x = self.vectorizer.transform(raw_tweets)
        return x
    def predict(self,x):
        pred = self.clf.predict(x)
        return {"scores" : pred}

# Get variable list: ['s1','s2','s3','s4','s5','w1',...]
variables = train_data.columns[4:].tolist()

for variable in variables:
    clf = ElasticNet(alpha=1e-5)
    y_train = train_data[variable].tolist()
    clf.train(x_train, y_train)
    tweet_clf = TweetClassifier(clf=clf,vectorizer=vectorizer)
    model_name = "TweetClassifier_%s" % (sentiment,)
    upload_status = yh.upload(model_name,tweet_clf)
    model_version = upload_status['version']
    print 'Model "%s" -- Version "%s": uploaded to yhat!' %\
        (model_name, model_version)

Improving the Tokenizer ;)

All the powerful mathematics in the world wont help you if your feature matrix doesn't contain the information you need.

Thus, feature selection may be viewed as selecting those words with the strongest signal-to-noise ratio. Pragmatically, the goal is to select whatever subset of features yields a highly accurate classifier.

George Foreman, Feature Selection for Text Classification, 2-3

Weak features limit the performance of strong models and, while we can always just import another sklearn classifier, encapsulating informative signals from noisy data takes a bit more creativity.

Our feature space is composed of the unique words produced by the tokenizer. Let's take a look at a few tweets from our dataset and see how nltk.word_tokenize handles them.

In [4]:
for row in [7568, 7826, 7866]:
    raw_tweet = train_data['tweet'].get(row)
    print raw_tweet
    print nltk.word_tokenize(raw_tweet.lower()), '\n'
#WEATHER:  4:52 am : 61.0F. Feels 60F. 29.53% Humidity. 6.9MPH Southeast Wind.
['#', 'weather', ':', '4:52', 'am', ':', '61.0f.', 'feels', '60f.', '29.53', '%', 'humidity.', '6.9mph', 'southeast', 'wind', '.'] 

This weather is simply wonderful :) #fb
['this', 'weather', 'is', 'simply', 'wonderful', ':', ')', '#', 'fb'] 

103 degrees?????!?! RT @mention: This heat is something else
['103', 'degrees', '?', '?', '?', '?', '?', '!', '?', '!', 'rt', '@', 'mention', ':', 'this', 'heat', 'is', 'something', 'else'] 

Since our features only consider unique words our model can't tell that numbers like "42" and "41.4" actually hold a lot of similar information or that "98.0F" holds the same meaning as "98 degrees". If we want our classifier to consider numbers, times, or speeds then we have to map these kinds of values onto a common strings.

A good example of this comes from the paper Twitter Sentiment Classification using Distant Supervision by Stanford Students Go, Bhayani, and Huang. They found gains in accuracy by binning usernames, and urls into similar meta strings. But my favorite part of this paper is undoubtedly the emoticon analysis. Notice in the second tweet above how ":)" is converted to ':' and ')'. To strengthen the the information our features convey, then we want to retain the information of "there's a smiley face here".

Let's try using regular expressions to improve the tokenization process.

In [5]:
import re

pos_emoticons = [':)',':-)',': )',':D','=)']
neg_emoticons = [':(',':-(',': (','=(']
re_time = re.compile(r'\d{1,2}:\d\d[ ]?(am|pm)?')       # Time
re_temp = re.compile(r'(\d+\.?\d*) ?(fahrenheit|celcius|f|c|degrees|degree)(\W|$)')
re_velo = re.compile(r'\d+\.?\d* ?mph')                 # Velocity
re_perc = re.compile(r'\d+\.?\d* ?(%|percent)')         # Percent
re_nume = re.compile(r'(\s|^)-?\d+\.?\d*(\s|$)')        # Numeric

# Define a new tokenizer
def tweet_tokenize(tweet):
    tweet = tweet.lower()
    for emoticon in pos_emoticons:
        tweet = tweet.replace(emoticon,' SMILY ')
    for emoticon in neg_emoticons:
        tweet = tweet.replace(emoticon,' FROWNY ')
    tweet = re_time.sub(' TIME ',tweet)
    tweet = re_temp.sub(r' \1 TEMP ',tweet)
    tweet = re_velo.sub(' SPEED ',tweet)
    tweet = re_perc.sub(' PERC ',tweet)
    tweet = re_nume.sub(' NUM ',tweet)
    tokens = nltk.word_tokenize(tweet)
    return tokens

for row in [7568, 7826, 7866]:
    raw_tweet = train_data['tweet'].get(row)
    print raw_tweet
    print tweet_tokenize(raw_tweet.lower()), '\n'
#WEATHER:  4:52 am : 61.0F. Feels 60F. 29.53% Humidity. 6.9MPH Southeast Wind.
['#', 'weather', ':', 'TIME', ':', 'NUM', 'TEMP', 'feels', 'NUM', 'TEMP', 'PERC', 'humidity.', 'SPEED', 'southeast', 'wind', '.'] 

This weather is simply wonderful :) #fb
['this', 'weather', 'is', 'simply', 'wonderful', 'SMILY', '#', 'fb'] 

103 degrees?????!?! RT @mention: This heat is something else
['NUM', 'TEMP', '?', '?', '?', '?', '!', '?', '!', 'rt', '@', 'mention', ':', 'this', 'heat', 'is', 'something', 'else'] 

That's a bit better. As you can see by mapping rare tokens like numbers and smilies onto common strings we convey to our learning algorithm that they contain common meanings. There are still a lot of similar improvements we can make, but this is a good place to start.

Submitting to Kaggle: yhat to the Rescue!

What adds complexity to this competition is that it immediately requires 24 different classifiers, one for each variable. What's worse, if we really want to optimize our performance we have to consider the fact that each one of those variables may require different model types, parameters, parsing strategies, etc. There are an enormous of changes that our machine learning pipeline may experience; all of which may either have a positive or negative impact on the accuracy for a given variable.

Fear not.

Because Yhat stores the entire routine, I never have to keep track of any of the internals. I can pass unprocessed tweets to any model and it will give me predictions. The only data I retain is which model version produced the best results for each sentiment.

This means that when I'm writing the program to score and submit to Kaggle, I don't have to dig through old git commits to remember which tokenizers or algorithms I have to reimplement. Suddenly that whole process gets condensed down to this:

In []:
__filename__ = ""

from yhat import Yhat
import pandas as pd
import numpy as np

best_model = {
 's1':1, 's2':1, 's3':1, 's4':1, 's5':1,
 'w1':1, 'w2':1, 'w3':1, 'w4':1,
 'k1':1, 'k2':1, 'k3':1, 'k4':1, 'k5':1,
 'k6':1, 'k7':1, 'k8':1, 'k9':1, 'k10':1,
 'k11':1, 'k12':1, 'k13':1, 'k14':1, 'k15':1

# Read data
test_data = pd.read_csv('./data/test.csv')
sub_data = pd.read_csv('./data/sampleSubmission.csv')

raw_tweets = test_data['tweet'].tolist()

yh = Yhat("USERNAME", "APIKEY")

# Get variable list: ['s1','s2','s3','s4','s5','w1',...]
variables = sub_data.columns[1:]
for variable in variables:
    moded_version = best_model[variable]
    model_name = "TweetClassifier_%s" % (variable,)
    # Get scores from yhat
    results_from_server = yh.raw_predict(model_name,
    pred = results_from_server['prediction']['scores']
    sub_data[variable] = pred

Final Thoughts

Full version of these code snippets on github.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.