The Yhat Blog

machine learning, data science, engineering

SQL for pandas DataFrames

by yhat |

We recently wrote a post about 10 of our favorite R packages, and the response has been wonderful:

We've had the opportunity to chat with several of you about your experiences learning and using R and Python, and we're extremely appreciative for your questions, candor, and suggestions.

Many experienced R users graciously wrote to thank us for turning them on to lesser known R packages like qcc and forecast, and a lot of you blogged, emailed, and tweeted about discovering the sqldf package through our post which we really appreciate. This did not go unnoticed!

This is a post about pandasql, a library we're open-sourcing for Python which lets you use SQL on pandas dataframes.



pandasql is a Python package for running SQL statements on pandas DataFrames. It has a lot in common with the sqldf package in R.

Behind the scenes, pandasql uses the module to transfer data between DataFrames and SQLite databases. Operations are performed in SQL, the results returned, and the database is then torn down. The library makes heavy use of pandas write_frame and frame_query, two functions which let you read and write to/from pandas and (most) any SQL database.


You can use pip to install pandasql. Alternatively, you can clone the Github repo and install from source.

$ pip install -U pandasql

Built-in Datasets

pandasql has two built-in datasets which we'll use for the examples below.



Write some SQL and execute it against your pandas DataFrame by substituting DataFrames for tables.

pandasql creates a DB, schema and all, loads your data, and runs your SQL.


pandasql supports aggregation. You can use aliased column names or column numbers in your group by clause.

locals() vs. globals()

pandasql needs to have access to other variables in your session/environment. You can pass locals() to pandasql when executing a SQL statement, but if you're running a lot of queries that might be a pain. To avoid passing locals all the time, you can add this helper function to your script to set globals() like so:


You can join dataframes using normal SQL syntax.

WHERE conditions

Here's a WHERE clause.

It's just SQL

Since pandasql is powered by SQLite3, you can do most anything you can do in SQL. Here are some examples using common SQL features such as subqueries, order by, functions, and unions.

Final thoughts

pandas is an incredible tool for data analysis in large part, we think, because it is extremely digestible, succinct, and expressive. Ultimately, there's a ton of reasons to learn the nuances of merge, join, concatenate, melt and other native pandas features for slicing and dicing data. Check out the docs for some examples.

However, our hope is that pandasql will be a helpful learning tool for folks new to Python and pandas. In my own personal experience learning R, sqldf was a familiar interface helping me become highly productive with a new tool as quickly as possible. We hope you'll check out pandasql; if you do, please let us know what you think!


Twitter: @YhatHQ

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.