The Yhat Blog


machine learning, data science, engineering


Using Rodeo to Analyze Presidential Candidate Tweets

by Stephen Hsu and Michael Luong |


Introduction

It’s election season and for anyone who has been following it, they can confirm that it’s been the craziest, most social-media covered, and drama-filled election yet. With the final ballot set and with less than three months before the election, we thought it would be intriguing to see how the candidates compared by analyzing their speech and interactions with the public.

It seemed befitting to analyze the most talked about election on the media by analyzing the candidate’s own twitter accounts. By reading and summarizing tweets, our goal was to figure out what kind of jargon candidate’s used to capture the attention of the masses.

Let's begin!

First and foremost, we used Rodeo as our IDE weapon of choice, alongside Twitter’s API to pull the tweets, and finally Tableau for visualizations. That being said, the first thing to do is download Rodeo at https://www.yhat.com/products/rodeo. Afterwards, the next step is to gain access to Twitter’s API by registering for an app to obtain the token keys for the API at https://dev.twitter.com/ . Finally, download (for free if using student account) Tableau software at www.tableau.com/Tableau-Download.

Import tools

import matplotlib.pyplot as plt
import pandas as pd
import csv
from twython import Twython
from twython import TwythonStreamer
import pandas as pd
from collections import Counter

Bonus: You can also use the improved package tab in Rodeo to find and install packages.

Set up twitter developer

APP_KEY = "ENTER YOUR KEY HERE"
APP_SECRET = "ENTER YOUR SECRET HERE"
twitter = Twython(APP_KEY, APP_SECRET)

Data mining

After the tools have been imported and the access keys have been created, the first step is to mine the tweets given the username and present it in a data friendly way. The API call above is used to return the tweet data in JSON format; in order to get the data into a format we can use, we use a for loop to pull the favorites, retweets, date, and text and separate them into distinct lists.

text = []
favorites_count = []
retweets_count = []
date = []

for i in range(0, 17):
user_timeline = twitter.get_user_timeline(screen_name="realDonaldTrump",count=200, include_rts=False)
for tweets in user_timeline:
    favorites_count.append(tweets['favorite_count'])
    retweets_count.append(tweets['retweet_count'])
    date.append(tweets['created_at'])
    text.append(tweets['text'].encode('utf-8'))

After all the lists have been formed, we can merge it into a clean, concise data frame.

Trumpdf = pd.DataFrame({"Favorites" : favorites_count,
    "Retweets" : retweets_count,
    "Date" : date,
    "Text": text})

    Trumpdf.head()

Using the additional columns, we can figure out speech changes over the course of the election, however for our specific goal, we are interested in the overall frequency of words the candidates use - thus we’ll be focusing on the “text” list.

Data Cleaning

That brings us to the data cleaning aspect of the project, for the “text” list contains certain special characters like the “@” symbol and the “:” symbol as well as capitalized and non capitalized words. These undoubtedly cause errors and miscounts for further analysis.

Our goal is to first turn the list into a string, and then get rid of certain words leftover which are not special characters but still formatted words such as “www” and “http.”

Trumpstring = ''.join(str(e) for e in text)
Trumpstring = re.sub(r"<a\S+", "", Trumpstring)
cleanTrumpstring = str(re.sub('\W+',',', Trumpstring))
keys = cleanTrumpstring.split(",")

Data Wrangling

Now that the tweet sentences are tweet words, we can now begin collecting and counting them - just like a dictionary. Our first task is to initiate an empty dictionary and either add a new words or add to the word count.

word_counts = {}
for word in keys:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

The dictionary now has all unique words counted, but is filled with words that do not add much meaning to our analysis. To combat this, we will first only consider words mentioned over an arbitrary amount (50 in this case) and then delete entries we have deemed to be stop words.

filtertrump = {k: v for k, v in word_counts.items() if v > 50}

Stopwords = ['A','AND','An','And','As','Be','C','this','they',
'Can','D','Do','Don','For','Go','He','IS','Is','It','K','M', 'I',
'O','On','P','Q','R','S','So','T','THE','That','The','Their','There',
'They','This','U','we','you','words','w','ve','u','the','that','than',
'her','he','had','for','e','d','co','a','V','TV','was','to','so','she'
't','such','some','s','re','my','m','ll','is','it','of','as','at','am',
't', 'or','our','in','do','be','them','they','their','this','were','when',
'who','with','what','amp','an','and','are','000','your','she','him','his',
'get','but','would','https','on','realDonaldTrump','have','will','all','has',
'just', 'MakeAmericaGreatAgain','now','out','about','from','by']

for k,v in filtertrump.items():
    if k in Stopwords:
        del filtertrump[k]

Now let’s see what filtertrump looks like:

Data Visualization

Now that the mining and cleaning is finally complete, it is time for the visualizations. Our first graph is a bar graph with the top twenty most used words over Trump’s twitter. By viewing the top twenty words only, this will further block any special characters and purposeless words from sneaking into our graphs.

toptwenty = dict(Counter(filtertrump).most_common(20))
plt.bar(range(len(toptwenty)), toptwenty.values(), align='edge')
plt.xticks(range(len(toptwenty)), toptwenty.keys(), rotation=90)
plt.title('Trump Twenty Most Frequent Words')
plt.xlabel('Words')
plt.ylabel('Counts')

Similarly, we can find Hillary Clinton’s top twenty words by changing the screen name equal to “HillaryClinton” in the for loop earlier.

Conclusion

Now that the first graphs have been completed, we can now analyze the data and find out speech tendencies and patterns between the candidates.

Trump most used words: Names such as “Hillary Clinton”, “Trump”, and Mike Pence references Clinton most used words: Trump overwhelmingly, followed by mentions of groups such as “Americans” / “American”, “women”, and “people”

Furthermore, we can see that Hillary’s most used words are related to Donald Trump and are being used three times as often as the next most common word. Meanwhile, Trump’s most frequent words do include references to Clinton, but are not as disproportionate. Also, we can see the difference between each candidate by seeing that Trump discusses more about himself by using words such as ‘me’ and ‘ImWithYou’ while Clinton is more inclusive by using words such as ‘us’ and ‘people’.

Additional: Using Tableau for tag clouds

Additionally, the dictionaries made can be saved and further visualized through Tableau software. Using the following code, we can save a copy of the dictionary right onto the computer.

with open(‘Trump.csv', 'wb') as f:
    writer = csv.writer(f)
    for row in filtertrump.iteritems():
        writer.writerow(row)

Once this has been run, the csv file can be opened on Tableau and drag and drop F1 (text) into color and text under “marks” and F2 (counts) into size under “marks” to create the following tag clouds.

Congratulations! You’ve now mined data from tweets using an API, cleaned the text into a usable format, visualized the data in two different formats, and analyzed the candidates' speech.



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.