The Yhat Blog

machine learning, data science, engineering

7 handy SQL features for data scientists

by yhat |

No matter where you work odds are your company uses some sort of relational database. And nearly all of these use SQL for data retrieval. So it goes without saying that ultimately if you want to play with data, you need to learn a little bit of SQL. You might argue that SQL does a lot of the same things that you can do in R, Python, or (heaven forbid) Excel. This is true to a certain extent, but by writing the SQL code yourself there a few advantages:

  • Performing operations in the database is way more efficient
  • Easily reproducible scripts
  • Closer to the data

Demand for data scientists is booming, but even highly quantitative analysts and PhDs may have limited experience with databases. In making the transition from academia to industry, a little SQL can go a long way. Sometimes just knowing what features are out there is all it takes.

This is a post about awesome things you can do with SQL.

Why Postgres?

There are lots of database flavors: MySQL, PostgreSQL, Oracle, Microsoft SQL Server. Even with all of the variants, the SQL syntax remains relatively constant across all of them. We're using PostgreSQL because it's widely used, has handy analytical features, and it's popular in both data warehousing and production.

  1. Generating queries from a query

    I often find the need to use data in Postgres to fetch related data found in another system such as a CRM, MongoDB, or a website. Doing basic string concatenation makes it really easy to generate these queries in mass. I've found it to be really useful for spidering and web scraping.

    You can use psycopg2 to query Postgres from Python and then use requests to fetch data from Yelp.

  2. Basic date operations

    Dates are always tricky. Despite best efforts, dates never seem to be in the format that you need them in.

    Fear not, Postgres has fantastic date functions to deal with this stuff.

    date_trunc, to_char, and to_timestamp can take you a pretty long way in terms for formatting and type conversion. For extracting certain values from a date (say the day of the week) you can use date_part.

  3. Text mining

    When I'm doing text mining I'm often tempted to immediately reach for a scripting language. While a language like Python, Ruby, or R definitely should make a showing in any text mining project, I advocate going as far as you can with SQL. Postgres has great built-in string functions that'll run much faster than any scripting language.

    My favorite Postgres string function is regexp_split_to_table which takes a piece of text, splits it by a pattern, and returns tokens as rows.

    See #5 below for loading the Ferris Bueller dataset.

    More string functions

  4. Median aggregate function

    Median is sort of the lost aggregate function in Postgres. It's one of those that everyone assumes comes standard but it actually doesn't for reasons that are beyond the scope of this post. For many DBAs and DW Engineers, it's the first custom function they write for a new Postgres instance. Here you'll find the page from the Postgres wiki with code snippets for a median function. In the gist below I'm using the pure SQL implementation.

  5. \COPY to load data into your database

    Sooner or later you're going to want to load some of that great data you've scraped from Yelp or some Census CSV file into your database. Once you have your flat file ready, cd to the directory and connect to your Postgres instance.

    We're going to load some dialogue from Ferris Bueller's Day Off into our database. You can get the file here. Open up Postgres in the terminal:

    $ psql -d {your database} -U {your username}

    Now execute the \COPY command, specifying the table name, column names (optional), filename, and if you're handling a CSV, then include the CSV argument. This file also has a header row so add an additional argument HEADER.

    And there you have it, you can now query Ferris Bueller's Day Off!

    Sometimes you'll get just one row that has some strange comma or hyphen that messes up your \COPY statement. To skirt around this, create a staging table where every column is varchar. Once the data is in your database, create a query that casts columns to the appropriate data type and deal w/ isolated issues at that point.

  6. Generating sequences

    generate_series is a great Postgres built in function for generating sequences of numbers. You can use it for creating ranges of dates and times, handling time series, funnels, etc. It also provides an easy to way enumerate over tables and prevents you from having to write for loops in your SQL code (not fun).

    MySQL version
  7. Assorted things you should know

    Some of these are pretty standard, but I couldn't chance leaving any of these out. I use these features every day.

    order by random() is extremely helpful when you're just trying to get a feel for a table or database. It does pretty much what it sounds like it does--it returns things in random order! Double colon :: is shorthand for casting columns to different data types. It works well with dates (and does a pretty good job of guessing what format you're using), and does just as good with floats, ints, and other numerics. Select into syntax is by far the easiest way I've found for creating tables on the fly. It's ruined me for pre-defining tables.

Other Resources

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.