The Yhat Blog


machine learning, data science, engineering


Making Sense of Everything with words2map

by Lance Legel |


We are now at a point in history when algorithms can learn, like people, about pretty much anything. Facebook is pursuing “computer services that have better perception than people”, while researchers at Google aim to “solve intelligence”. At overlap.ai we’re building artificial intelligence to unite people through their overlapping passions, and here we introduce a framework we call words2map for considering what our users love, like these personal passions of ours:

Let’s explain the code that made this pretty picture just from the raw text, which serves as the basis for our recommender system. When we set out to automatically recommend groups and events to people that they may wish to join, we didn’t have any data to start with, and yet the unreasonable effectiveness of data was not lost on us. So we decided to build from a pre-trained machine learning model that has basically already seen all there is to see: we set up a word2vec model from Google trained on 100 billion words — just a few orders of magnitude bigger than Wikipedia. And indeed we found amazing insights embedded in the vectors, like:

  • human + robot ≈ cyborg
  • electricity + silicon ≈ solar cells
  • virtual reality + reality ≈ augmented reality

Minus the futurism fetish above, you’ll see the math above is fairly simple. We call this “adding vectors”, while below you see what we are really doing is averaging each element across all the vectors. Let’s dissect the guts here, since deft applications of the vector algebra will take us far in terms of hacking intelligence into and out of the system:

The key point above is that new vectors are simply nearby existing vectors, such that king - man + woman ≠ queen (at least not exactly). From this, it is clear that any new words can be introduced to the model simply by adding existing vectors for existing words. This idea is similar to how humans add words to their own vocabulary with a dictionary: consider some familiar words and combine their meanings to figure out new words. Our A.I. responds to unknown words by searching the web for them, extracting keywords from relevant websites, and adding vectors for those keywords it knows, to produce a new vector representing the new thing. This feels like it shouldn’t work, like becoming a scientist just by reading Wikipedia instead of doing real science experiments, but surprisingly its performance is robust. Here is another cool map from words2map (which you can easily download and play with yourself) that demonstrates this:

Once each word is derived in 300 dimensions (from the word2vec model trained on 100 billion words) we then applied t-SNE to embed them into 2D, as x and y coordinates for data visualization, like seen above. This is indeed nice for data visualization, while it’s also very helpful in our pipeline because it removes noise in the derived vectors, by forcing a new mapping based purely on relative similarity. For this reason we will be using the low-dimensional coordinates of each word in our recommender system.

The above pipeline is strong enough for many applications, but in our case we wanted to go further to be able to uncover clusters of similar types of activities that our users really enjoy (and also, to be able to quickly infer what types of things you don’t like). That way, we can recommend more or less of these things to you, and do so with high precision purely through A.I. At this point in our research we tried balltrees as a way to identify clusters:

Upon close examination, these clusters are not bad, but we felt they weren’t as responsive to the topology of the underlying data as we’d like. So we continued researching and found a clustering algorithm that works really well for our complex distributions: HDBSCAN, i.e. “hierarchical density-based spatial clustering of applications with noise”. It sounds fancy, but it’s very simple to work with, given that you only need to put in 2D coordinates.

The full pipeline for all of this word vector hackery comes together like so:

We’re very happy to make words2map available through github, and have worked hard to make sure that almost anyone with a Mac or Linux terminal can quickly download and play with it by copying and pasting:

git clone https://github.com/overlap-ai/words2map.git
cd words2map
./install.sh

We welcome critical feedback, and anyone interested in joining us in advancing the state of this art across machine learning + data visualization, to make it better and better for everyone.

As of today, words2map is mapping every group that our users launch through our iPhone app — which you can download now. We’ve tested it, and know it reliably derives reasonable vectors in an online way for any topic that’s learnable on the web — i.e. just about every topic. Readers are invited to try out words2map for any type of natural language processing, and share their maps using a #words2map hashtag.

Readers are also invited to overlap with us over A.I. and data science​, in New York City and beyond​. This summer we’re hosting coffee chats, data hackathon​s​, and other fun stuff so we can connect, join forces, and hack​ cool stuff​​ together​.​ Just download the app here and soon you'll be nerding with us IRL.

Big thanks to the data hipsters at Yhat for helping us share something new on their blog, and for spearheading great open source data science tools like Rodeo and ggplot. And thanks to so many scientists, engineers, and leaders who have helped make all of this possible. We love you:



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.