The Yhat Blog


machine learning, data science, engineering


Building a Client-side Blog Search Algorithm

by Greg |


The Blog gets too big

A few months ago we noticed that our blog was getting really hard to navigate. One of the most frequent requests we would get from people was the ability to search for posts. Sounds a bit obvious but we were all a bit surprised. Searching for posts?!? How many posts can we possibly have? Do we really have that many that people need to be able to search for them?

Turns out we do. This post will be the 62nd one we've released in the past 2 years. The blog is roughly 16,261 lines, 100,961 words, or 951,130 characters depending on how you're counting. That was all it took for us to realize that we needed a better way for users to find our content!

Static Sites

We're big fans of static sites. They're cheap, easy to use, and don't break. Even when you're on the front page of Hacker News. Our blog uses Cactus which I could not recommend highly enough.

However, when we decided we wanted to add search, having a static site became problematic. A static site meant we had to find a purely client-side search solution. After looking around for a few search-as-a-service companies, we failed to find one met our requirements of being cheap or free, client-side, and capable of being implemented in less than 20 minutes.

That said, we also really didn't want to roll our own search engine. That's when it hit us, "Why don't we just use ScienceOps for this?".

No server, no mercy.

ScienceOps to the Rescue

For those of you that don't know, ScienceOps is our predictive model deployment product. It lets you take analytical routines you've written in Python or R and turn them into APIs. It's easy to use, quick to get going, AND (most importantly here) it can be called directly from the browser.

I tossed together a short script that would read in all of our blog post titles and then store them in a Python dictionary. I then wrapped that in a ScienceOps model and added a query parameter, q. The model takes q and does a quick search through all of the titles and then returns the best results.

This all looks something like below:

from yhat import Yhat, YhatModel
import sh

docs = {}
for page in sh.find("pages").strip().split("\n"):
    # the title is the first line of each post
    docs[page] = open(page).read().split('\n')[0]

class Search(YhatModel):
    def execute(self, data):
        q = data.get("q", "")
        results = []
        for path, title in docs.items():
            if q in title:
                results.append(path)
        return results
yh = Yhat("greg", "apikeygoeshere", "http://cloud.yhathq.com/")
print yh.deploy("BlogSearch", Search, globals(), True)

Bam! After deploying our model to ScienceOps we immediately have ourselves a simple, easy-to-use search engine for our blog and we've only taken up 5 minutes of our allotted 20 minutes.

curl -X POST -H "Content-Type: application/json" \
    --user greg:apikeygoeshere \
    --data '{"q":"Python"}' \
    http://cloud.yhathq.com/greg/models/BlogSearch/
{
  "yhat_model": "BlogSearch",
  "yhat_id": "8186aa5d-878a-42f6-9d95-c67a82517441",
  "result": [
    "pages/posts/image-classification-in-Python.html",
    "pages/posts/11-python-libraries-you-might-not-know.html",
    "pages/posts/digit-recognition-with-node-and-python.html",
    "pages/posts/sparse-random-projections.html",
    "pages/posts/logistic-regression-and-python.html",
    "pages/posts/comparing-random-forests-in-python-and-r.html",
    "pages/posts/setting-up-scientific-python.html",
    "pages/posts/naive-bayes-in-python.html",
    "pages/posts/classification-using-knn-and-python.html",
    "pages/posts/random-forests-in-python.html",
    "pages/posts/the-beer-bandit.html",
    "pages/posts/data-science-in-python-tutorial.html"
  ],
  "version": "ea14ce5"
}

Improving our model: The Points Method

As you can imagine, just checking whether or not a string exists in a title doesn't make for great search results. I decided I needed improve my results just a little bit.

I decided to go with the "points approach". It's super simple: just assign a score to each page based on how many times the search query appears in both the title and the content of the post. In the event that a query is more than 1 word, we just treat each word as it's own query and aggregate each post's points.

It's crude, simple, and doesn't scale, but we don't need that right now! We just need something that works for 62 pages.

import operator
from yhat import Yhat, YhatModel
import sh

docs = {}
for page in sh.find("pages").strip().split("\n"):
    content = open(page).read()
    docs[page] = (content, content.split('\n')[0].encode("ascii", "ignore"))

class Search(YhatModel):
    def execute(self, data):
        q = data.get("q", "")
        q = q.lower()
        results = {}
        for subquery in q.split():
          for path, (title, content) in docs.items():
              title = title.lower()
              content = content.lower()
              points = 0
              points += 3*title.split().count(subquery)
              points += content.split().count(subquery)
              if points > 0:
                  results[path] = results.get(path, 0) + points
        results = sorted(results.items(), key=operator.itemgetter(1), reverse=True)
        return [item[0] for item in results]

Stick it on the front-end

Luckily for us, we've already built a Yhat Javascript client. That means to actually use my model, all I have to do is add a tiny bit of Javascript to execute the search, and then filter the pages displayed accordingly.

$("#search").submit(function(e) {
  e.preventDefault();
  var data = { q: $("#query").val() };
  yhat.predict("greg", "myapikey", "BlogSearch", data, function(err, d) {
    if (d.result) {
      $(".blog-post").addClass("hide");
      d.result.map(function(item) {
        $('*[data-path="' + item+ '"]').removeClass('hide');
      });
    }
  });
  return false;
});

Searching for the post about searching the Yhat Blog using the search term "Yhat Blog". Meta, I know.

How Yhat uses Yhat to power the Yhat Blog...

Well there you have it. Now you know the story behind how that little search bar at the top of the page got there.

To learn more about ScienceOps or our other products, go here.



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.