About Piero: Piero Ferrante is the Director of Data Science at C2FO. He and his team are focused on optimizing C2FO’s capital markets through applied machine learning and developing contemporary quantitative risk management systems. Piero also enjoys teaching, rowing, and hacking on open data. You can find him on Twitter and LinkedIn.
Geospatial analysis is a massive field with a rich history. Python has some pretty slick packages for working with geospatial data such as, but not limited to, Shapeley, Fiona, and Descartes.
GeoPandas sits on top of these packages and exposes a familiar Pandas-like API that makes a series of element-wise and aggregation methods (from the base packages) easy to apply to dataframes containing geometry data. From the GeoPandas repo:
"GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and descartes and matplotlib for plotting.
GeoPandas geometry operations are cartesian. The coordinate reference system (crs) can be stored as an attribute on an object, and is automatically set when loading from a file. Objects may be transformed to new coordinate systems with the to_crs() method. There is currently no enforcement of like coordinates for operations, but that may change in the future."
Note The image above has no association with the GeoPandas project, although it would be pretty cool if it did. It's called the 'Geometric Hipster Panda' and was found here.
We'll run through a basic example analysis detailing the following:
- How to create a GeoDataFrame from a shape file
- Plotting GeoDataFrames
- A couple helpful operations for performing faster point-in-polygon calculations
- Simulating spatial, temporal user data
- And a few nifty Jupyter widgets, magic functions, and add-ins along the way
# basic stuff import os import pandas as pd import numpy as np from random import randint, uniform from datetime import datetime from urllib import urlretrieve from zipfile import ZipFile # geo stuff import geopandas as gpd from shapely.geometry import Point # from ipyleaflet import (Map, # Marker, # TileLayer, ImageOverlay, # Polyline, Polygon, Rectangle, Circle, CircleMarker, # GeoJSON, # DrawControl # ) # plotting stuff %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns plt.style.use('bmh') plt.rcParams['figure.figsize'] = (10.0, 10.0) # widget stuff from ipywidgets import interact, HTML, FloatSlider from IPython.display import clear_output, display # progress stuff from tqdm import tqdm_notebook, tqdm_pandas %load_ext autotime # turn warnings off import warnings warnings.filterwarnings('ignore')
Helper function for retrieving NYC borough shape file
def get_nyc_shape_file(url, filename): # download file zipped = filename + '.zip' urlretrieve('https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile', zipped) zipped = os.getcwd() + '/' + zipped # unzip file to_unzip = ZipFile(zipped, 'r') unzipped = os.getcwd() + '/' + filename + '_unzipped' to_unzip.extractall(unzipped) to_unzip.close() # get shape file for file in os.listdir(unzipped): if file.endswith(".shp"): shape_file = unzipped + '/' + file # return full file path return shape_file
# get shape file path shape_file_url = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile' shape_file_dir = 'nyc_boroughs' file_path = get_nyc_shape_file(shape_file_url,shape_file_dir)
Creating GeoDataFrames is simple
# read and view GeoDataFrame gdf = gpd.GeoDataFrame.from_file(file_path) gdf.head()
|0||5.0||Staten Island||(POLYGON ((-74.05050806403247 40.5664220341608...||1.623820e+09||330470.010332|
|1||4.0||Queens||(POLYGON ((-73.83668274106707 40.5949466970158...||3.045213e+09||896344.047763|
|2||3.0||Brooklyn||(POLYGON ((-73.86706149472118 40.5820879767934...||1.937479e+09||741080.523166|
|3||1.0||Manhattan||(POLYGON ((-74.01092841268031 40.6844914725429...||6.364715e+08||359299.096471|
|4||2.0||Bronx||(POLYGON ((-73.89680883223774 40.7958084451597...||1.186925e+09||464392.991824|
# plot GeoDataFrame gdf.plot()
time: 1.16 s
Now we'll use some of GeoPandas powerful methods to simplify our geometry (I'll explain why we're doing this later)
# create convex hulls hulls = gdf['geometry'].convex_hull # plot overlay hulls.plot(ax=gdf.plot())
time: 1.35 s
# create envelopes envelope = gpd.GeoSeries(hulls.envelope) # plot overlay envelope.plot(ax=gdf.plot())
time: 1.2 s
Cool! Now let's simulate some user data...
We're also using tqdm's integrated Jupyter functionality to update a progress var.
jupyter nbextension enable --py --sys-prefix widgetsnbextension
def sim_users(n, p, f): # create datetime range today = datetime.today().strftime("%m/%d/%Y") rng = pd.date_range(today, periods=p, freq=f) # get min/max coorindates min_x, min_y = gdf['geometry'].bounds.ix[:,:2].min() max_x, max_y = gdf['geometry'].bounds.ix[:,2:].max() # iterate over time datetime range and create user list sim_user_list =  for ts in tqdm_notebook(rng, desc='Progress', leave=True): for j in xrange(n): x = uniform(min_x, max_x) y = uniform(min_y, max_y) point = Point(x, y) gender = randint(0, 1) sim_user_list.append([ts, x, y, point, gender]) # return dataframe sim_user_df = pd.DataFrame(sim_user_list, columns=['datetime', 'x', 'y', 'point', 'gender']) return sim_user_df # simulate user data sim_data = sim_users(100, 24, 'H') sim_data.head()
|0||2016-06-30||-73.880440||40.872977||POINT (-73.88044042881104 40.8729770935262)||0|
|1||2016-06-30||-73.970945||40.896804||POINT (-73.97094522264351 40.89680428203247)||1|
|2||2016-06-30||-74.163396||40.517064||POINT (-74.16339558536102 40.51706409984719)||1|
|3||2016-06-30||-73.984582||40.731043||POINT (-73.98458153634843 40.73104254398787)||1|
|4||2016-06-30||-73.700708||40.613141||POINT (-73.70070797457826 40.61314107591482)||0|
time: 178 ms
# plot simulated data points gdf.plot() plt.scatter(x=sim_data['x'], y=sim_data['y'], alpha=0.5, c='r')
time: 1.13 s
Now suppose we wanted to see which of those user points fell in different NYC boroughs. A simple solution would be to use a point-in-polygon (PIP) calculation to determine if a given point is contained within a given geometry (like Queens). This can become pretty computationally expensive if the geometry in question is complex, which most are. A reasonable solution, if absolute precision is not a requirement, would be to simplify the aforementioned geometry and in turn simplify the PIP calculation.
Let's compare the runtime for calculating a PIP boolean value for any of the 5 boroughs using the actual geometry, convex hulls, and envelopes with the user data that we just simulated.
Note: I should point out that point-in-polygon indexing will not scale well. This is where geohashes and quadtrees are of tremendous use, but are also outside the scope of this blog post.
# check if point(s) fall within known geometry - actual sim_data['contains_1'] = sim_data['point'].map(lambda x: True if gdf.contains(x).any()==True else False)
time: 53.8 s
# check if point(s) fall within known geometry - convex hulls sim_data['contains_2'] = sim_data['point'].map(lambda x: True if hulls.contains(x).any()==True else False)
time: 551 ms
# check if point(s) fall within known geometry - envelopes sim_data['contains_3'] = sim_data['point'].map(lambda x: True if envelope.contains(x).any()==True else False)
time: 326 ms
That's about a 90x and 160x speed up respectively! It's clear that greater precision comes at a significantly higher computational cost.
Create a widget to visualize users moving over time
def make_plot(hour=1): # filter dataframe temp = sim_data[sim_data['contains_2']==True] temp=temp[temp['datetime'].dt.hour==hour] # plot hulls.plot(ax=gdf.plot()) plt.scatter(x=temp['x'], y=temp['y'], s=30) # create widget interact(make_plot, hour=(1, 23, 1))
That's all folks! Hope this was helpful intro to doing basic geospatial analysis in Python.
Other helpful links: