The Yhat Blog

machine learning, data science, engineering

R and pandas and what I've learned about each

by yhat |

In step with our recent article about essential R packages, this post explores tools for data analysis in Python.

what is pandas?

pandas is the utility belt for data analysts using python. The package centers around the pandas DataFrame, a two-dimensional data structure with indexable rows and columns. It has effectively taken the best parts of Base R, R packages like plyr and reshape2 and consolidated them into a single library. It has lots of features (see library highlights). pandas gets its name from panel data, an econometrics term for multidimensional structured datasets (McKinney 5., 2013)

Pandas has a lot in common with R (pandas comparison with R), and as someone who's familiar with R and Python (but not specifically pandas) I've found pandas to be extremely easy to use. This is a post about R and pandas and about what I've learned about each.

Munging and Plotting in Python

  • plyr-esq features in Python

    Few tools hold a candle to pandas when it comes to Split-Apply-Combine operations. pandas groupby enables transformations, aggregations, and easy-access plotting functions. Virtually anything you can do with R's plyr package has a pandas equivalent.

    One thing I like better about groupby than, say, ddply, is the ability to perform an operation in multiple steps. pandas let's you perform the group part on one line followed by the apply part on the next. This allows you to inspect the combined results on a third line, giving you visibility into what's going on under the hood.

    Additionally, pandas is faster than plyr. In some instances I found equivalent operations to be 4x+ faster using pandas' groupby over plry's ddply.

  • applying functions element-wise

    If you use R, you know that most of the time you can get by with plyr. But every once in a while you need to bust out lapply or sapply. In pandas, on the other hand, you can use apply on both DataFrames and Series.

    When you use apply on a dataframe, you can apply your function along either rows or columns (axis=0 or axis=1). When you apply on a series, you're applying only on that series.

  • wide to long and back again

    R's reshape2 makes it extremely easy to switch your data between wide and long formats. pandas has its own set of functions that provide this functionality. pandas also has a concept called stacking and unstacking which allow you to shift the index of a pandas dataframe.

  • plot

    One of my favorite parts about R is you can call plot on just about anything and R will render an appropriate graphic you'd expect.

    pandas measures up with its own out-of-the-box plotting powered by matplotlib.

    DataFrames and Series can both be plotted using the plot method along with standard hist and boxplot.

    matplotlib is an excellent plotting library, but I have to say I still prefer the look and feel of ggplot2 graphics. I always end up getting more props when I circulate ggplots. rplot, is a module found in this pandas fork providing ggplot2-like interfaces for pandas, though I'm not sure whether or not the fork is actively being developed at this time.

  • data.frame

    While the implementation might be different, pandas data frames and R data frames have a lot in common. Most of the core functionality between the two are the same - they both allow column-wise operations on your data, they're tabular, etc. The biggest difference I've found is the way which you operate on the data frames themselves.

    R has a much more functional feel to it. Instead of calling a particular method on an R data frame, you invoke a function on an R dataframe. pandas has a much more OOP feel to it. Dataframe methods are called with the dataframe itself. One feature that I haven't seen in R but that comes in handy in pandas is multi-level indexing. pandas allows you to create indicies based not only on row number, but also on dates, numbers, and even categorical variables.

This just scratches the surface of pandas' functionality. Another topic that isn't mentioned in this post is the excellent time series capabilities that pandas has (similar to zoo in R). They're extensive enough that it merits its own post. In the meantime you can check out some of Wes McKinney's great tutorials.

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.