The Yhat Blog


machine learning, data science, engineering


A new ggplot is here

by Greg |


Departure from original plan

A brief note from the author...

After launching ggplot in September of 2014 to great fanfare, there was a bit of a "development hiatus". It took a while, but I finally freed up enough of my weekends to give ggplot some TLC (oh wait not that TLC, this TLC). It was long overdue but I think you'll be really happy with the results.

0.10.0 also marks a shift in project goals for python-ggplot. While the initial intention was to mimic the R-ggplot API, it's now clear that this isn't 100% neccessary (and can make for some really strange looking code). In the future, you can expect there to be significant feature overlap, but not neccessarily mimicry (after all, R is a little weird).

I still strongly believe that there's a place for ggplot in the python data ecosystem. matplotlib is still an absolute bastard, bokeh is focused more on web-viz, and seaborn is serving an entirely different audience. This actually comes as a surprise to me. I thought seaborn was really going to take off. But having tried using it myself, it still lacks a lot of the simple yet power vocabulary that the ggplot syntax gives you.

This latest release went two steps backwards and 3 steps forward. I deleted considerably more code than I added. I started with 26,857 lines of code and ended with 5,851 lines. A lot of sub-optimal things were corrected, things that "half-worked" got cut, and implementations got simpler. The hope is that 0.10.0 is a much better foundation for future work and will be much easier to build on top of (not to mention use)!

Now without further ado, let's talk about what's new!

Theming

ggplot 0.10.0 introduces the ability to customize your own themes. Using the theme function you can adjust the look and feel of your plot on the fly. In addition to the theme function, we're also introducing support for element_text. Which means you can now customize the text and fonts used within your plots!

Let's say I wanted to adjust the font and the angle at which my x-axis labels were displayed:

In addition to axis_text_x, theme currently supports the following customizations:

  • title
  • plot_title
  • plot_margin
  • axis_title
  • axis_title_x
  • axis_title_y
  • axis_text
  • axis_text_x
  • axis_text_y

It's obviously not everything you might possibly want to customize, but it's a start!

Richer expression support

ggplot had always supported basic python expressions from within aesthetics (i.e. aes(x='size**2'), aes(x='n + 1')). It was handy for handling one-off calculations for things like basic arithmetic adjustments to your data. That was all well and good, but the latest release of ggplot has taken this a (few) step(s) forward.

ggplot now supports arbitrary python expressions from within aesthetics. For some of you this might not be a big deal, but for others it might come as a liberation. Take this example...

Let's say you wanted to plot a histogram of the log value of the difference between A and B. Instead of having to make a new column in your data frame to do this, you can make it happen from within your aesthetics:

aes(x='np.log(B - A)')

Note that I used a numpy function from within my expression! And it's not just numpy. Anything that you've got imported into your python session is going to work. Take this pandas example:

New Datasets

In addition to features/bug-fixes, ggplot got a data facelift. We've added the following example datasets to the package:

  • chopsticks: chopstick effectiveness experiment results
  • mpg: breakdown of fuel effeciency for different make/models of cars
  • pigeons: pigeon race results
  • salmon: catch quotas/rates for different species of salmon

For those of you who are ggplot2 users, you might recognize mpg. It's been a part of the ggplot2 package for the past couple of years and is featured prominently in the documentation.

food_pinching_efficiency individual chopstick_length
0 19.55 1 180
1 27.24 2 180
2 28.76 3 180
3 31.19 4 180
4 21.91 5 180
0.10.0 welcomes the chopsticks dataset

Others might recognize the chopsticks and pigeons datasets from a blog post we did about a year ago that showcased some less popular (but very fun) datasets. The post was really well received (it appears the data community has a particular interest in novelty datasets) and so we thought we'd incorporate some of the datasets into ggplot! Hope you enjoy them!

Get ready for pigeon plots!

Major Bugfixes / New back-end

While I'm not typically one to celebrate non-user-facing changes, there were a few dramatic improvements to the ggplot "back-end" that fix a swarm of bugs and generally makes it easier to use and much more reliable.

The faceting system has been rebuilt from the ground up. In the initial implementation, facets were an after thought. This turned out to be a huge pain because subplots really dictate how a plot gets constructed.

Intertwined with the the faceting system is ggplot's data splitting/aggregation manager. In the past this didn't work very well with faceting. So we got rid of that too! It's been replaced by a much shorter, simpler set of functions/code.

In addition to faceting, the new legend system is significantly better. ggplot no longer relies on matplotlib's built-in legend function. The matplotlib legend function doesn't really translate to making layered plots. We bit the bullet and rebuilt our own--turned out it wasn't that hard :).

New Docs!

One thing we realized with previous released of ggplot was that the documentation wasn't right. We added docs for every geom, tried making function docstrings more descriptive, and even making the docs prettier. Unfortunately none of this really worked. What we realized was that while we were adding more documentation, it was all API driven (i.e. describing how individual functions worked). When what people really needed help with was how to do particular tasks using ggplot (i.e. "How do I make a histogram?").

So with the new release we're taking a more task oriented approach to our docs. That means lots of "How To's", loads of examples, and very visual docs. The new docs use jupyter notebooks, which makes it really easy to contribute examples and update for new releases changes.

You can check out the docs here. If you're interested in contributing, take a look at the gh-pages on GitHub.

Better Geoms

ggplot 0.10.0 introduces support for geom_boxplot, geom_errorbar, geom_violin, and geom_rect.

geom_boxplot

geom_errorbar

geom_violin

geom_rect

New colors and scales

Another more subtle thing you'll notice in the new ggplot are sharper, more "ggplot2ish" colors. The new color palettes are exactly the same as ggplot2--previous versions had slightly different hue/saturation values. The result is easy to distinguish, better looking plots.

In addition to better default colors, there are some new color palette / scale options as well. scale_color_funfetti / scale_fill_funfetti give you some delicious options for adding brightly colored elements to your plots. There's also a scale_color_yhat / scale_color_yhat for all of you Yhat lovers out there that want to use our company colors for your plots (we actually added this for some of our Python tutorials but thought they looked good enough to keep in the public ggplot library!).



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.