The Yhat Blog


machine learning, data science, engineering


Basic Interactive Geospatial Analysis in Python

by Piero Ferrante |


About Piero: Piero Ferrante is the Director of Data Science at C2FO. He and his team are focused on optimizing C2FO’s capital markets through applied machine learning and developing contemporary quantitative risk management systems. Piero also enjoys teaching, rowing, and hacking on open data. You can find him on Twitter and LinkedIn.

Intro

Geospatial analysis is a massive field with a rich history. Python has some pretty slick packages for working with geospatial data such as, but not limited to, Shapeley, Fiona, and Descartes.

GeoPandas sits on top of these packages and exposes a familiar Pandas-like API that makes a series of element-wise and aggregation methods (from the base packages) easy to apply to dataframes containing geometry data. From the GeoPandas repo:

"GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and descartes and matplotlib for plotting.

GeoPandas geometry operations are cartesian. The coordinate reference system (crs) can be stored as an attribute on an object, and is automatically set when loading from a file. Objects may be transformed to new coordinate systems with the to_crs() method. There is currently no enforcement of like coordinates for operations, but that may change in the future."

Note The image above has no association with the GeoPandas project, although it would be pretty cool if it did. It's called the 'Geometric Hipster Panda' and was found here.

We'll run through a basic example analysis detailing the following:

  • How to create a GeoDataFrame from a shape file
  • Plotting GeoDataFrames
  • A couple helpful operations for performing faster point-in-polygon calculations
  • Simulating spatial, temporal user data
  • And a few nifty Jupyter widgets, magic functions, and add-ins along the way

Setup

# basic stuff
import os
import pandas as pd
import numpy as np
from random import randint, uniform
from datetime import datetime
from urllib import urlretrieve
from zipfile import ZipFile

# geo stuff
import geopandas as gpd
from shapely.geometry import Point
# from ipyleaflet import (Map,
#     Marker,
#     TileLayer, ImageOverlay,
#     Polyline, Polygon, Rectangle, Circle, CircleMarker,
#     GeoJSON,
#     DrawControl
# )

# plotting stuff
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('bmh')
plt.rcParams['figure.figsize'] = (10.0, 10.0)

# widget stuff
from ipywidgets import interact, HTML, FloatSlider
from IPython.display import clear_output, display

# progress stuff
from tqdm import tqdm_notebook, tqdm_pandas
%load_ext autotime

# turn warnings off
import warnings
warnings.filterwarnings('ignore')

Helper function for retrieving NYC borough shape file

def get_nyc_shape_file(url, filename):

    # download file
    zipped = filename + '.zip'
    urlretrieve('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile', zipped)
    zipped = os.getcwd() + '/' + zipped

    # unzip file
    to_unzip = ZipFile(zipped, 'r')
    unzipped = os.getcwd() + '/' + filename + '_unzipped'
    to_unzip.extractall(unzipped)
    to_unzip.close()

    # get shape file
    for file in os.listdir(unzipped):
        if file.endswith(".shp"):
            shape_file =  unzipped + '/' + file

    # return full file path
    return shape_file


# get shape file path
shape_file_url = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile'
shape_file_dir = 'nyc_boroughs'
file_path = get_nyc_shape_file(shape_file_url,shape_file_dir)

Creating GeoDataFrames is simple

# read and view GeoDataFrame
gdf = gpd.GeoDataFrame.from_file(file_path)
gdf.head()
boro_code boro_name geometry shape_area shape_leng
0 5.0 Staten Island (POLYGON ((-74.05050806403247 40.5664220341608... 1.623820e+09 330470.010332
1 4.0 Queens (POLYGON ((-73.83668274106707 40.5949466970158... 3.045213e+09 896344.047763
2 3.0 Brooklyn (POLYGON ((-73.86706149472118 40.5820879767934... 1.937479e+09 741080.523166
3 1.0 Manhattan (POLYGON ((-74.01092841268031 40.6844914725429... 6.364715e+08 359299.096471
4 2.0 Bronx (POLYGON ((-73.89680883223774 40.7958084451597... 1.186925e+09 464392.991824

Plotting too!

# plot GeoDataFrame
gdf.plot()

time: 1.16 s

Now we'll use some of GeoPandas powerful methods to simplify our geometry (I'll explain why we're doing this later)

# create convex hulls
hulls = gdf['geometry'].convex_hull

# plot overlay
hulls.plot(ax=gdf.plot())

time: 1.35 s

# create envelopes
envelope = gpd.GeoSeries(hulls.envelope)

# plot overlay
envelope.plot(ax=gdf.plot())

time: 1.2 s

Cool! Now let's simulate some user data...

We're also using tqdm's integrated Jupyter functionality to update a progress var.

Note: you may need to enable the Jupyter Javascript widget, which you can do by running: jupyter nbextension enable --py --sys-prefix widgetsnbextension

def sim_users(n, p, f):

    # create datetime range
    today = datetime.today().strftime("%m/%d/%Y")
    rng = pd.date_range(today, periods=p, freq=f)

    # get min/max coorindates
    min_x, min_y = gdf['geometry'].bounds.ix[:,:2].min()
    max_x, max_y = gdf['geometry'].bounds.ix[:,2:].max()

    # iterate over time datetime range and create user list
    sim_user_list = []
    for ts in tqdm_notebook(rng, desc='Progress', leave=True):
        for j in xrange(n):
            x = uniform(min_x, max_x)
            y = uniform(min_y, max_y)
            point = Point(x, y)
            gender = randint(0, 1)
            sim_user_list.append([ts, x, y, point, gender])

    # return dataframe
    sim_user_df = pd.DataFrame(sim_user_list, columns=['datetime', 'x', 'y', 'point', 'gender'])
    return sim_user_df

# simulate user data
sim_data = sim_users(100, 24, 'H')
sim_data.head()
datetime x y point gender
0 2016-06-30 -73.880440 40.872977 POINT (-73.88044042881104 40.8729770935262) 0
1 2016-06-30 -73.970945 40.896804 POINT (-73.97094522264351 40.89680428203247) 1
2 2016-06-30 -74.163396 40.517064 POINT (-74.16339558536102 40.51706409984719) 1
3 2016-06-30 -73.984582 40.731043 POINT (-73.98458153634843 40.73104254398787) 1
4 2016-06-30 -73.700708 40.613141 POINT (-73.70070797457826 40.61314107591482) 0

time: 178 ms

And plot...

# plot simulated data points
gdf.plot()
plt.scatter(x=sim_data['x'], y=sim_data['y'], alpha=0.5, c='r')

time: 1.13 s

Now suppose we wanted to see which of those user points fell in different NYC boroughs. A simple solution would be to use a point-in-polygon (PIP) calculation to determine if a given point is contained within a given geometry (like Queens). This can become pretty computationally expensive if the geometry in question is complex, which most are. A reasonable solution, if absolute precision is not a requirement, would be to simplify the aforementioned geometry and in turn simplify the PIP calculation.

Let's compare the runtime for calculating a PIP boolean value for any of the 5 boroughs using the actual geometry, convex hulls, and envelopes with the user data that we just simulated.

Note: I should point out that point-in-polygon indexing will not scale well. This is where geohashes and quadtrees are of tremendous use, but are also outside the scope of this blog post.

# check if point(s) fall within known geometry - actual
sim_data['contains_1'] = sim_data['point'].map(lambda x: True if gdf.contains(x).any()==True else False)

time: 53.8 s

# check if point(s) fall within known geometry - convex hulls
sim_data['contains_2'] = sim_data['point'].map(lambda x: True if hulls.contains(x).any()==True else False)

time: 551 ms

# check if point(s) fall within known geometry - envelopes
sim_data['contains_3'] = sim_data['point'].map(lambda x: True if envelope.contains(x).any()==True else False)

time: 326 ms

That's about a 90x and 160x speed up respectively! It's clear that greater precision comes at a significantly higher computational cost.

Create a widget to visualize users moving over time

def make_plot(hour=1):

    # filter dataframe
    temp = sim_data[sim_data['contains_2']==True]
    temp=temp[temp['datetime'].dt.hour==hour]

    # plot
    hulls.plot(ax=gdf.plot())
    plt.scatter(x=temp['x'], y=temp['y'], s=30)

# create widget
interact(make_plot, hour=(1, 23, 1))

That's all folks! Hope this was helpful intro to doing basic geospatial analysis in Python.

Other helpful links:

  • PyData NYC conference video from GeoPandas creator Kelsey Jordahl
  • ipyleaflet - A Jupyter / Leaflet bridge enabling interactive maps in the Jupyter notebook


Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.