The Yhat Blog


machine learning, data science, engineering


Introducing Gobenchdb

by Eric Cox |


A while back we wrote a post about how we started using Go at Yhat. It's now been slightly over a year from that post and we definitely consider ourselves a "Go shop", with the majority of our code-base now written in Go. As part of our adoption of Go, we've been building out benchmarking and scaling tools for our core products. These tools allow us to assess the performance and scalability of ScienceOps and ScienceBox, and to produce benchmarks for code commits over time. I imagine that Groucho Marx sitting on a bench would be proud...(bench-Marx)

bench-Marx....

Introducing Gobenchdb

While the Go standard library provides an easy way to write benchmarks and the Go command line tool has a test command with the option to run benchmarks go test -bench, we also wanted a way to organize the data generated by go test -bench and write it to a database to track system performance over time.

For this reason, we created the command line tool, gobenchdb.

Visit the repo here

Benchmark Tests in Go

To overview, benchmark tests are used to assess the performance characteristics of a program or hardware. Some interesting benchmarks include the LINPACK Benchmark which is used to rank the top 500 supercomputers in the world. Benchmarks can also be used to describe language performance improvements. For example, the latest state of Go talk has a nice plot that compares performance of Go 1.0 to Go 1.4 using benchmark data. In this plot we see how the performance of the language has changed over time as well as how specific code changes have impacted performance.

Go Performance Benchmarks: Reference state-of-go-may

Writing Your Own Benchmarks

To benchmark our own products, we wanted to produce data like the above plot to drill into how code we were writing was impacting performance. The gobenchdb command line tool helps us do just that. Digging in further, let's see how to write a benchmark test in Go, run go test -bench, and then look at the data produced as a prelude to a discussion on how gobenchdb works. The example we'll be using throughout this post is a trusty sorting function.

Here is the function borrowed from the Go standard library sort package.

func MySort(data sort.Interface, a, b int) {
        // Insertion sort borrowed from the std library.
        for i := a + 1; i < b; i++ {
                for j := i; j > a && data.Less(j, j-1); j-- {
                data.Swap(j, j-1)
        }
    }
}

Below is a simple benchmark test that allocates an integer slice of length 1000 and uses the MySort function to sort it.

func BenchmarkMySort1K(b *testing.B) {
    // Benchmarks IncrAndSort on random int slice of size 1K.
    b.StopTimer()
    for i := 0; i < b.N; i++ {
        data := make([]int, 1000)
        for i := 0; i < len(data); i++ {
            data[i] = rand.Int()
        }
        b.StartTimer()
        MySort(sort.IntSlice(data), 0, len(data))
        b.StopTimer()
    }
}

The benchmark test will run b.N times, but what is interesting is that during execution b.N will be adjusted so that the benchmark time is reliable. Now you can use the go test -bench tool to run the benchmark. Just cd to the directory where your benchmark lives and run go test -bench with a regex that selects which benchmark test(s) to run. The . will run all benchmarks defined and we'll use the -benchmem flag to print memory allocation statistics.

$ go test -bench="." -benchmem
testing: warning: no tests to run
PASS
BenchmarkMySort1K   10000    298578 ns/op      32 B/op       1 allocs/op
ok  github.com/yhat/gobenchdb/benchdb    3.974s

The output tells us that the BenchmarkMySort1K ran with b.N = 10000, took 464201 nanoseconds per operation, and used 32 bytes per operation.

Using gobenchdb

If we run gobenchdb in the same directory that we just ran go test -bench="." -benchmem in, here is what happens:

$ gobenchdb -conn="postgres://yhat:foopass@localhost:5432/benchmarks" -table="mysort"
testing: warning: no tests to run
PASS
BenchmarkMySort1K    5000    418017 ns/op      32 B/op       1 allocs/op
ok  github.com/yhat/gobenchdb/benchdb    2.821s

Notice we get the same benchmark data written to stdout, and the data is also written to the mysort table of a postgres database defined by the connection string postgres://yhat:foopass@localhost:5432/benchmarks. The output written to a database from one gobenchdb run looks like this:

benchmarks=> select * from mysort where name='MySort1K' limit 1;
 id | batch_id | latest_sha |          datetime          |   name   |  n   |  ns_op  | allocated_bytes_op | allocs_op
----+----------+------------+----------------------------+----------+------+---------+--------------------+----------
 67 | 5e8be3e  | daea568    | 2015-07-01 20:35:34.245573 | MySort1K |  200 | 8346359 |                 32 |         1

The syntax for gobenchdb is straightforward.

Usage: gobenchdb [options...]

Options:
  -conn        sql database connection string
  -table       sql table name
  -test.bench  run only those benchmarks matching the regular expression

All that you need to do is create a simple schema.

# postgres
CREATE TABLE IF NOT EXISTS benchmarks (
    id                    serial primary key,
    batch_id              varchar(50),
    latest_sha            varchar(50),
    datetime              timestamp without time zone,                                                                                      
    name                  varchar(50),
    n                     integer,
    ns_op                 double precision,
    allocated_bytes_op    integer,
    allocs_op             integer
);

After that, you will be able to track your benchmarks over time. You can install gobenchdb using

go get github.com/yhat/gobenchdb

Tracking Benchmark Data Over Time

The main use-case for gobenchdb is to organize go test -bench data for a large Go project over time. If your project uses git and has a git history, gobenchdb will add the latest commit sha to your benchmark data. If your project does not use git, the latest sha will be ignored. Recall that the output in the database is formatted:

benchmarks=> select * from mysort where name='MySort1K' limit 1;
 id | batch_id | latest_sha |          datetime          |   name   |  n   |  ns_op  | allocated_bytes_op | allocs_op
----+----------+------------+----------------------------+----------+------+---------+--------------------+----------
 67 | 5e8be3e  | daea568    | 2015-07-01 20:35:34.245573 | MySort1K |  200 | 8346359 |                 32 |         1

As you can see, gobenchdb parses the output from a line of go test -bench using the Go parse package, adds a batch_id, latest_sha (which is the git sha of HEAD), and a timestamp. This allows us to identify separate runs of gobenchdb by way of the batch_id, letting us map changes in source code to benchmark times using the latest_sha.

Here is how this works with the MySort example. We added the MySort example to the benchdb package tests so we can have a git history. If you navigate to github, you will see the initial commit with a sha of daea568. Suppose that gobenchdb was running for a week and then a code change was made that uses the sort package from the Go standard library. The new code change simply calls sort.Sort from package sort.

func MySort(data sort.Interface, a, b int) {
    sort.Sort(data)
}

The head of the repo will have a different git sha 010fa05 and assuming we ran gobenchdb every day for a week, we could then we query our database.

benchmarks=> select * from mysort where name='MySort1K';
 id | batch_id | latest_sha |          datetime          |   name   |  n   |  ns_op  | allocated_bytes_op | allocs_op
----+----------+------------+----------------------------+----------+------+---------+--------------------+----------
 67 | 5e8be3e  | daea568    | 2015-07-01 20:35:34.245573 | MySort1K |  200 | 8346359 |                 32 |         1
 68 | df475e9  | daea568    | 2015-07-01 20:35:38.477753 | MySort1K |  200 | 8336226 |                 32 |         1
 69 | 8e6a1a8  | daea568    | 2015-07-01 20:35:43.729183 | MySort1K |  200 | 8396928 |                 32 |         1
 70 | b74cd68  | daea568    | 2015-07-01 20:35:48.836333 | MySort1K |  200 | 7358012 |                 32 |         1
 71 | 8781ef4  | daea568    | 2015-07-01 20:35:59.240162 | MySort1K |  300 | 8376768 |                 32 |         1
 72 | ee1e8c3  | daea568    | 2015-07-01 20:36:04.354633 | MySort1K |  200 | 8326351 |                 32 |         1
 73 | c90da13  | daea568    | 2015-07-01 20:36:10.165424 | MySort1K |  200 | 8434493 |                 32 |         1
 74 | b15f2b3  | 010fa05    | 2015-07-01 20:38:06.957301 | MySort1K | 3000 |  551273 |                 32 |         1
 75 | 9a63dd8  | 010fa05    | 2015-07-01 20:38:11.788717 | MySort1K | 3000 |  548965 |                 32 |         1
 76 | e7b2197  | 010fa05    | 2015-07-01 20:38:17.000288 | MySort1K | 3000 |  547492 |                 32 |         1
 77 | 9ccc321  | 010fa05    | 2015-07-01 20:38:29.453575 | MySort1K | 3000 |  537147 |                 32 |         1
 78 | eb117c3  | 010fa05    | 2015-07-01 20:38:35.805571 | MySort1K | 3000 |  551933 |                 32 |         1
 79 | 64f59e6  | 010fa05    | 2015-07-01 20:38:42.095991 | MySort1K | 5000 |  536412 |                 32 |         1
 80 | ea13e78  | 010fa05    | 2015-07-01 20:38:49.778406 | MySort1K | 3000 |  548242 |                 32 |         1

We can now observe the change in performance of MySort by plotting our data.

MySort Performance Benchmarks

There you have it! We can now map code changes to performance benchmarks and track the data over time. A good way to automate this process is by using Jenkins or executing a script on a cron job.

BONUS: Running Benchmarks on Jenkins

We recently started using Jenkins to build ScienceOps and ScienceBox Linux distributions on a nightly basis and also rigged up our Jenkins server to kick off benchmark tests using gobenchdb. Doing this allows us to keep tabs on our benchmarks on a daily basis and to track how code changes affect performance. Getting Jenkins to run your benchmarks using gobenchdb is very straightforward: all that's needed is to write a build script. Here is an example bash script that should help you get up and running.

#!/bin/bash

# Make temp dir for your goroot and gopath
GOTMP=$(mktemp -d)
finish() {
    rm -rf $GOTMP
}
trap finish EXIT

# Get the latest go dist
curl -ks https://storage.googleapis.com/golang/go1.4.2.linux-amd64.tar.gz | tar -C $GOTMP -xz

# Setup your gopath
export GOROOT=$GOTMP/go
export PATH=$GOROOT/bin:$PATH
mkdir -p $GOTMP/gopath
export GOPATH=$GOTMP/gopath
export PATH=$GOPATH/bin:$PATH

# Go get gobenchdb
cd $GOPATH
go get github.com/yhat/gobenchdb

# Clone your favorite repo
cd $GOPATH
git clone git@github.com:docker/docker.git

# Run gobenchdb
cd $GOPATH
gobenchdb -conn="postgres://yhat:foopass@localhost:5432/dbname" -table="docker_benchmarks"

Final Thoughts

There are docs for gobenchdb on GoDocs and the source code is on Github as well. We've implemented a benchdb package with a BenchDB interface that will allow for other databases to be used in the near future. At the time of this post an implementation of BenchDB for postgres works, and sqlite3 is coming soon. Feel free to contribute and add your favorite database!

That's all we'll cover today but be sure to stay tuned for more engineering posts and the latest results for our new ScienceOps and ScienceBox benchmarks!

Visit the repo here



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.