The Yhat Blog


machine learning, data science, engineering


Introducing db.py

by Greg Lamp |


Data analysis libraries for Python keep getting better and better. pandas is now on 0.15, scikit-learn continues to pick up converts, and a number of visualization libraries have started to emerge: for example, our own baby, ggplot, and others like seaborn. But there's still one subject that's practically synonymous with data analysis where there aren't any new, killer libraries: databases.

There are some great Python database libraries (SQLAlchemy instantly comes to mind), but none are focused on using Python as a means of exploring and analyzing data.

Introducing db.py

So today we're introducing db.py, a library built for doing exactly that! db.py does handy stuff like:

  • Manages/remembers your db connection strings
  • Lets you easily search through your schema
  • Allows for tab-completion for table and column names
  • Integrates with IPython, pandas, and is built for interactive programming
  • Works with most relational databases

To get started, just run:

$ pip install db.py

Connecting to your Database

First things first: you need to connect to your database. This can be much tricker than it should be. Dealing with connection strings is never fun--how many times have you forgotten your password (or even your username) for a database. db.py makes it as straight forward as possible and will always revert to "reasonable defaults". For example, the default MySQL port is 3306. If you forget to specify a port, db.py will use 3306 as a default.

>>> from db import DB
>>> db = DB(username="kermit", password="rainbowconnection", hostname="db.themuppets.com",
            dbname="muppetdb", dbtype="mysql")
Refreshing schema. Please wait...done!
>>> db
DB[mysql][db.themuppets.com]:3306 > kermit@muppetdb

db.py will also save your credentials so you don't have to paste them into every script. Now you'll never have to remember any of that connection stuff again.

>>> db.save_credentials(profile="muppets")

Open a new session, load your profile, and voilà!

>>> from db import DB
>>> db = DB(profile="muppets")
Refreshing schema. Please wait...done!
>>> db
DB[mysql][db.themuppets.com]:3306 > kermit@muppetdb

And by database we mean dotfile.

Exploring Tables and Columns

Ever forget the name of a column? Or even forget the names of what table you're trying to query? Me too. db.py comes with a set of functions for navigating your database's schema.

NOTE: For this post we're going to use the Chinook Database. It ships with db.py and is commonly used as an example database.

>>> from db import DemoDB # this is the Chinook DB and is used in the db.py docs/examples
>>> db = DemoDB()
Indexing schema. This will take a second...finished!
>>> db.find_table("A*")
+--------+--------------------------+
| Table  | Columns                  |
+--------+--------------------------+
| Album  | AlbumId, Title, ArtistId |
| Artist | ArtistId, Name           |
+--------+--------------------------+

Ever forget how to join your company's prod_users, prod_trans_monthly, and prod_accounts tables together? (a better question might be: have you ever remembered?)

No worries, db.py has got you covered!

>>> db.tables.Artist
+----------------------------------------------------------+
|                          Artist                          |
+----------+---------------+--------------+----------------+
| Column   | Type          | Foreign Keys | Reference Keys |
+----------+---------------+--------------+----------------+
| ArtistId | INTEGER       |              | Album.ArtistId |
| Name     | NVARCHAR(120) |              |                |
+----------+---------------+--------------+----------------+
>>> db.find_column("*Id")
+---------------+---------------+---------+
| Table         |  Column Name  | Type    |
+---------------+---------------+---------+
| Album         |    AlbumId    | INTEGER |
| Album         |    ArtistId   | INTEGER |
| Artist        |    ArtistId   | INTEGER |
| Customer      |  SupportRepId | INTEGER |
| Customer      |   CustomerId  | INTEGER |
| Employee      |   EmployeeId  | INTEGER |
| Genre         |    GenreId    | INTEGER |
| Invoice       |   InvoiceId   | INTEGER |
| Invoice       |   CustomerId  | INTEGER |
| InvoiceLine   |   InvoiceId   | INTEGER |
| InvoiceLine   |    TrackId    | INTEGER |
| InvoiceLine   | InvoiceLineId | INTEGER |
| MediaType     |  MediaTypeId  | INTEGER |
| Playlist      |   PlaylistId  | INTEGER |
| PlaylistTrack |    TrackId    | INTEGER |
| PlaylistTrack |   PlaylistId  | INTEGER |
| Track         |  MediaTypeId  | INTEGER |
| Track         |    TrackId    | INTEGER |
| Track         |    AlbumId    | INTEGER |
| Track         |    GenreId    | INTEGER |
+---------------+---------------+---------+

Tab Completion

db.py also identifies and loads schema info about all of the tables in your database and makes them attributes of the DB(). It works with IPython's type-ahead functionality, so when you're mucking around in IPython Notebook or in Terminal, you'll be able to see all of the tables and columns in your database.

You can also take a look at all of your tables by accessing the tables variable.

>>> db.tables
+---------------+----------------------------------------------------------------------------------+
| Table         | Columns                                                                          |
+---------------+----------------------------------------------------------------------------------+
| Album         | AlbumId, Title, ArtistId                                                         |
| Artist        | ArtistId, Name                                                                   |
| Customer      | CustomerId, FirstName, LastName, Company, Address, City, State, Country, PostalC |
|               | ode, Phone, Fax, Email, SupportRepId                                             |
| Employee      | EmployeeId, LastName, FirstName, Title, ReportsTo, BirthDate, HireDate, Address, |
|               |  City, State, Country, PostalCode, Phone, Fax, Email                             |
| Genre         | GenreId, Name                                                                    |
| Invoice       | InvoiceId, CustomerId, InvoiceDate, BillingAddress, BillingCity, BillingState, B |
|               | illingCountry, BillingPostalCode, Total                                          |
| InvoiceLine   | InvoiceLineId, InvoiceId, TrackId, UnitPrice, Quantity                           |
| MediaType     | MediaTypeId, Name                                                                |
| Playlist      | PlaylistId, Name                                                                 |
| PlaylistTrack | PlaylistId, TrackId                                                              |
| Track         | TrackId, Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, Uni |
|               | tPrice                                                                           |
+---------------+----------------------------------------------------------------------------------+

IPython and pandas

db.py is built with popular Scientific Python tools in mind--namely pandas and IPython. We've tried to make it as friendly as possible for those of you doing interactive computing: this means friendly repr's!

>>> db.tables.Artist

Column Type Foreign Keys Reference Keys
ArtistId INTEGER Album.ArtistId
Name NVARCHAR(120)

>>> db.tables.Track

Column Type Foreign Keys Reference Keys
TrackId INTEGER InvoiceLine.TrackId, PlaylistTrack.TrackId
Name NVARCHAR(200)
AlbumId INTEGER Album.AlbumId
MediaTypeId INTEGER MediaType.MediaTypeId
GenreId INTEGER Genre.GenreId
Composer NVARCHAR(220)
Milliseconds INTEGER
Bytes INTEGER
UnitPrice NUMERIC(10,2)

>>> df = db.tables.Artist.all()
>>> df.head()

ArtistId Name
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette
4 5 Alice In Chains

Querying data

Executing queries in libraries like psycopg2 can be frustrating. Managing connections, cursors, and rollbacks isn't exactly intuitive. db.py operates on a much higher level, and makes executing queries simple.

>>> db.query("select * from Artist limit 10;")
   ArtistId                  Name
0         1                 AC/DC
1         2                Accept
2         3             Aerosmith
3         4     Alanis Morissette
4         5       Alice In Chains
5         6  Antônio Carlos Jobim
6         7          Apocalyptica
7         8            Audioslave
8         9              BackBeat
9        10          Billy Cobham
>>> with open("myscript.sql", "w") as f:
...     f.write("select * from Artist;")
>>> df = db.query_from_file("myscript.sql")

Managing Credentials

You know those really obscure database hostnames and passwords that your sys admin sets up? Well instead of copying and pasting them into all of your scripts, db.py lets you save credentials so you can use them later.

>>> db = DB(username="kermit", password="rainbowconnection", hostname="dw.themuppets.com", dbtype="mysql")
>>> db.save_credentials(profile="muppetdw")
# in a new session
>>> db = DB(profile="muppetdw") # voila!

Hope you like it

We're still working on support for MS SQL. If you have ideas for features or are interested in getting the latest updates, check out the db.py github page.

Special thanks to Eli Bressert from StichFix for giving us the idea for db.py!



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.