The Yhat Blog


machine learning, data science, engineering


Sharks, Landsharks, Geoplotting, and KDTrees!

by Colin Ristig |


It's been somewhat of a sharky summer these past 3 months: there were a smattering of attacks up and down the east coast of the US, professional surfer Mick Fanning, had a close call at a competition in South Africa, and several beaches were closed in California after a Great White took a bite out of a surfboard! So now with the end of summer officially here (at least in the northern hemisphere), we thought it would be interesting to dig into some shark attack data. In this post, we'll look through the Global Shark Attack File, checkout some of the characteristics of shark attacks and then dive in to some geo-plotting with Matplotlib Basemap.

The Github repo with relevant files for this post is here

Watch out for Landsharks!

The Data

Let's get a quick feel for our data. After downloading the Excel file from the ISAF, available here we can read in the file and start looking around. Some important things to note:

  • There are 5759 events in the dataset
  • Attacks date back to 1845
  • Attack Types and counts are:
    • Unprovoked - 4239
    • Provoked - 521
    • Invalid - 489
    • Boating - 291
    • Sea Disaster - 219
  • 1527 Fatal Attacks
  • The database is actively maintained by researchers at the ISAF
import pandas as pd
import numpy as np
import re

# read in xls as a dataframe
df = pd.read_excel('GSAF5.xls')

# clean our columns
df['Activity'] = df['Activity'].str.lower()
df['Activity'] = df['Activity'].str.replace('-','')
df['Species '] = df['Species '].str.lower()
df['Country'] = df['Country'].str.upper()

df.rename(columns={'Fatal (Y/N)':'Fatal','Species ':'Species','Sex ':'Sex'}, inplace=True)
df.drop(['Case Number.1', 'original order', 'Unnamed: 21', 'Unnamed: 22','pdf', 'href formula', 'href'],inplace=True, axis=1)

df.head()
Case Number Date Year Type Country Area Location Activity Name Sex Age Injury Fatal Time Species Investigator or Source
0 ND.0001 1845-1853 0 Unprovoked CEYLON (SRI LANKA) Eastern Province Below the English fort, Trincomalee swimming male M 15 FATAL. "Shark bit him in half... Y NaN NaN S.W. Baker
1 ND.0002 1883-1889 0 Unprovoked PANAMA NaN Panama Bay 8ºN, 79ºW NaN Jules Patterson M NaN FATAL Y NaN NaN The Sun, 10/20/1938
2 ND.0003 1900-1905 0 Unprovoked USA North Carolina Ocracoke Inlet swimming Coast Guard personnel M NaN FATAL Y NaN NaN F. Schwartz, p.23; C. Creswell, GSAF
3 ND.0004 Before 1903 0 Unprovoked AUSTRALIA Western Australia NaN pearl diving Ahmun M NaN FATAL Y NaN NaN H. Taunton; N. Bartlett, pp. 233-234
4 ND.0005 Before 1903 0 Unprovoked AUSTRALIA Western Australia Roebuck Bay diving male M NaN FATAL Y NaN NaN H. Taunton; N. Bartlett, p. 234

Before we start getting into our analyses, we'll need to do a bit of cleaning...

# clean our Fatal column up
def clean_Fatal(x):
    if x == 'Y':
        return True
    elif x == 'UNKNOWN':
        return ''
    else:
        return False

df.Fatal = df.Fatal.map(clean_Fatal)

# these functions are included in the github repo
df['year'] = df['Date'].map(year)
df['month'] = df['Date'].map(month)
df['day'] = df['Date'].map(day)

Attacks by Activity

The activity column in this dataset is both useful but difficult. Below is a smattering of attack descriptions:

    df.Activity.value_counts()

    surfing              854
    swimming             779
    fishing              383
    spearfishing         303
    bathing              152
    wading               134
    diving               120
    standing              96
    snorkeling            74
    scuba diving          74
    body boarding         54
    body surfing          47
    swimming              47
    boogie boarding       42
    treading water        32
    pearl diving          32
    ....
    spearfishing           7
    kayak fishing          7
    kite surfing           6
    fishing for mackerel   6
    spearfishing on scuba  6
    murder                 6

Tragically, six people have been killed by sharks by being thrown or forced overboard:

df[(df.Activity=='murder')][['Date','Country','Activity','Injury']]
Date Country Activity Injury
146 Reported 1776 GUINEA murder FATAL
546 Reported 20-Oct-1893 INDIA murder FATAL
3254 17-Mar-1984 SOMALIA murder Forced at gunpoint to jump overboard. Presume...
4720 Oct-2006 GULF OF ADEN murder FATAL, beaten & thrown overboard by smugglers...
4752 22-Mar-2007 YEMEN murder FATAL, beaten & thrown overboard by smugglers...
4780 Jul-2007 SENEGAL murder NaN

While the granularity of the Activity column is good to have, it makes it fairly difficult to group attacks by type. For example, let's say we decided that "diving" was a distinct activity type. Below we see that there are 194 different "Activities" that contain the word diving! From here, we'd need to decide whether the various types of diving are different, i.e. is "diving for Bêche-de-mer" vs "diving for trochus" different, or somehow related to the frequency in which sharks attack divers, and should therefore be considered a separate "activity".

Trochus vs. some nice, dried Bêche-de-mer (or Sea Cucumbers for our non-French readers)

    len(df[df['Activity'].str.contains('diving')==True]['Activity'].value_counts())

    194

    diving                                                                 120
    scuba diving                                                            74
    pearl diving                                                            32
    free diving                                                             26
    freediving                                                              12
    ...
    diving for trochus                                                       9
    free diving / modeling                                                   1
    pearl diving from lugger whyalla                                         1
    diving from the lugger san, operated by the protector of the aborigines  1
    diving for bechedemer                                                    1

To better group the more common activity types, I've created a function to handle for specific cases: its messy and long, so I've included it in the repo.

But now that we have usable activity types, we can make some plots!

# create a list of top 10 Activities where the attack type is 'Unprovoked'
top_activities = df[df.Type=='Unprovoked'].Activity.value_counts().index.tolist()[:10]
df_top_activities = df[df.Activity.isin(top_activities) & (df.year >1950) & (df.Type=='Unprovoked')].dropna(axis=0,subset=['year'])
df_top_activities.groupby(['Activity','year']).count().reset_index()

# because we have years without attacks, we need to fill them with 0's
years = range(1950, 2016)
activities = df_top_activities.Activity.unique()

import itertools

# http://stackoverflow.com/questions/12130883/r-expand-grid-function-in-python
def expandgrid(*itrs):
   product = list(itertools.product(*itrs))
   return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}

all_combos = expandgrid(activities, years)
all_combos = pd.DataFrame(all_combos)
all_combos.columns = ["Activity", "year"]
all_years = pd.merge(all_combos, df_top_activities, on=["year", "Activity"], how='left')
all_years = all_years.groupby(['Activity','year']).count().reset_index()

from ggplot import *

ggplot(aes(x='year', y='Case Number', color='Activity'),all_years) + geom_line() + ylab("Number of Attacks") +\
    xlab("Year") + ggtitle("Shark Attacks by Activity Type")

The most significant trend we can see here is the increase in the number of attacks on surfers over the past 65 years. This is likely due to the increase in the popularity in surfing in places such as Australia, South Africa, and the USA, where sharks such as Great Whites and Bull Sharks are fairly common.

We can also filter this data to look just at fatal attacks. Based on the two charts, we see that despite rate of attacks on surfers and swimmers increasing over the past 65 years, fatal attacks have not increased at the same rate.

Where do shark attacks occur?

The ocean...duh brah!...but let's get a bit more specific. As you'll note from the table above, we have columns indicating the Country, Area and Location of the attacks, but no numerical or geo-spacial data to plot attacks with. Thankfully, the python package GeoPy has some great features for accessing API's such as the Google Maps API to pull in coordinates of points. We'll use the text descriptions of the attack location in combination with the API to get the longitude and latitude.

import time
from geopy.geocoders import GoogleV3
from geopy.geocoders import Nominatim
geolocator = GoogleV3()
geolocator2 = Nominatim()

As a quick example, I'll query the coordinates of the Yhat office:

geolocator.geocode('66 w 39th street new york city')

Location(66 West 39th Street, New York, NY 10018, USA, (40.7526707, -73.98517, 0.0))

Bam! Fast, super easy and accurate! Scaling this concept out, I've created a function to take the text descriptions of the attack locations, query the API for the coordinates and then write the results back to a latitude and longitude column. I've used two geocoders here in case of a timeout or some sort of error. This way, we should get a coordinates of each event.

# plotting unprovoked attacks in the USA post 1900
df2 = df[(df.Country=='USA') & (df.year>1900) & (df.Type=='Unprovoked')]

def get_location(row):
    time.sleep(.02)
    loc = str(row.Location) + ', ' + str(row.Area) + ', ' + str(row.Country)
    for _ in range(1):
        try:
            # try our first geocoder
            location = geolocator.geocode(loc)
            print(location)
            print(row.index)
            if not location==None:
                lat = location[1][0]
                long = location[1][1]
                return lat, long
            else:
                try:
                    # try our second geocoder if first one fails
                    location = geolocator.geocode2(loc)
                    print(location)
                    print(row.index)
                    if not location==None:
                        lat = location[1][0]
                        long = location[1][1]
                        return lat, long
                    else:
                        return None, None
                except:
                    return None, None
                    continue
        except:
            return None, None
            continue
        return row.latitude, row.longitude

# get our coordinates
df2['latitude'], df2['longitude'] = zip(*df2.apply(get_location,axis=1))

# write to a csv for future use b/c the API lookups take a longggg time
df2.to_csv('./sharks_coords.csv',index=False)

Taking a look at our results:

Area Location Country latitude longitude
0 North Carolina Somewhere between Hatteras and Beaufort USA 34.718217 -76.663819
1 Florida Gadsden Point, Tampa Bay USA 27.822248 -82.473150
2 Florida Palm Beach, Palm Beach County USA 26.705621 -80.036430
3 California Capistrano, Orange County USA 33.723150 -117.776661
4 Florida Mosquito Inlet (Ponce Inlet), Volusia County USA 29.096373 -80.936998

Now for some geo-plotting

To simplify things, we're only going to look at attacks in the USA. For plotting, we'll use the Matplotlib Basemap library. Matplotlib Basemap isn't as pretty as something like D3, (something to potentially revisit in a future post), but it's really easy to use and quickly lets us start visualizing our data.

# lets plot our attacks in the USA
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
%matplotlib inline

# create a new df with only locations we have coords for 
dfc = df2.dropna(axis=0, subset=['latitude'])

# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)

plt.figure(figsize=(18,18))

x1,y1 = USA_map(dfc['longitude'].tolist(), dfc['latitude'].tolist())
USA_map.plot(x1,y1, 'bo', markersize=7)

USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)
USA_map.drawcountries()

Nice!!! Shark attacks seem to be pretty coastal....

Dealing with Landsharks!

Looking at our map above a bit more closely, we might notice some strange behavior....LANDSHARKS!

Landsharks, as seen in the 1975 SNL sketch

As comical as the idea of landsharks may be, they actually do exist!

Well kinda.

A specific species of shark, the Bull Shark is one of the only species of sharks that can swim in brackish water. They're commonly found in bays and canals around Sydney, Australia, have been found over 2,500 miles inland up the Amazon River, been held responsible for nibbles in the Ganges River in India, and even been seen in Alton, Illinois, 1,750 miles from the coast.

Frankly, its terrifying.

But it also complicates our analyses as we can no longer assume with complete confidence that all shark attacks occur on the coast.

Despite this possibility of a "landshark shark attack" occurring, we can reasonably assume that the vast majority of attacks occur in the ocean. To start detecting our geographic outliers, we'll use a clustering algorithm that builds on this assumption.

Note: There are tons of different ways to do this. Outlier detection is a science in itself and I encourage readers to test out other methods and models to better capture landshark outliers!

For outlier detection, we're going to use an algorithm called KDTree. Put simply, KDTree (k-dimensional tree), builds a k-dimensional binary search tree, that we can use to determine neighbors and distances between neighbors. More detail on the algorithm can be found here.

The plan is to use this algorithm to:

  1. Create a tree of N-nearest neighbors
  2. Create a vector with the distance to N-nearest neighbors for each attack
  3. Sum the distances and sort them to determine the points furthest from all others

The assumption is that the landshark points are furthest from all other points, so the sum of their nearest neighbors should be largest.

from sklearn import neighbors

coords = np.column_stack((dfc.latitude,dfc.longitude))

# create our tree
tree = neighbors.KDTree(coords,leaf_size=2)
dist, ind = tree.query(coords[7], k=6) # look at row 7    
print(ind[0])  # indices of 5 closest neighbors
print(sum(dist[0][1:]))  # distances to 5 closest neighbors, excluding the point itself

[  7 379 251 152 420]
0.130272705615

# to add the sum of distances, we'll loop through our list 
sums = []
for i in range(len(coords)):
    dist, ind = tree.query(coords[i], k=6)
    sums.append(sum(dist[0][1:]))

# add the sums back to our dataframe
dfc['dist_sums'] = pd.Series(sums, index=dfc.index)

#sort them so we can find the outlier-est outlier
outliers = dfc.sort(['dist_sums'], ascending=False)

Now that we've create our sorted outlier list, we can plot them to see how many we've nabbed.

# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)

plt.figure(figsize=(18,18))

# we'll plot the top 25 outliers in red and the rest in blue
x1,y1 = USA_map(outliers['longitude'][25:].tolist(), outliers['latitude'][25:].tolist())
x2,y2 = USA_map(outliers['longitude'][:25].tolist(), outliers['latitude'][:25].tolist())

USA_map.plot(x1,y1, 'bo', markersize=7, alpha=.2)
USA_map.plot(x2,y2, 'ro', markersize=7)

USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawcountries()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)

Not bad. We've clearly more work to do but this definitely gets us well on our way.

We can also look specific areas in more detail simply by changing the longitude and latitude of the map corners.

CA_map = Basemap(projection='mill', resolution='l',
llcrnrlon=-128, llcrnrlat=33,
urcrnrlon=-113, urcrnrlat=44)

plt.figure(figsize=(18,10))

# coords for fatal attacks 
x1,y1 = CA_map(outliers[(outliers.Area=='California')]['longitude'][:10].tolist(),outliers[(outliers.Area=='California')]['latitude'][:10].tolist())
x2,y2 = CA_map(outliers[(outliers.Area=='California')]['longitude'][10:].tolist(),outliers[(outliers.Area=='California')]['latitude'][10:].tolist())


CA_map.plot(x2,y2, 'bo', markersize=5,alpha=.5)
CA_map.plot(x1,y1, 'ro', markersize=5)

CA_map.drawlsmask(lakes=False)
CA_map.drawcoastlines(color='gray')
CA_map.fillcontinents(color = 'lightgrey')
CA_map.drawmapboundary()
CA_map.drawstates()
CA_map.bluemarble(alpha=.2)

plt.show()

Again, we'll need to improve our outlier detection algorithm a bit, or perhaps use a different model altogether, but we've definitely made some solid progress.

Final Thoughts

Sharks are fascinating, incredible animals and I would be remiss to not mention in this post how dramatically disproportionate our fear of them is. Tens of millions of people go swimming, surfing, and diving every year and on average, only 7 people are killed by sharks every year in unprovoked attacks.

Geoplotting/mapping is a huge field and there are tons of fantastic resources out there to get started plotting things.

Python: Basemap tutorials

For more advanced plotting:

Python: So You’d Like To Make a Map Using Python by Sensitive Cities

D3: Let's Make a Map,from the legendary Mike Bostock

The Source Guide to Better Mapping from Open News

For all the scripts used to make this post, visit the github repo



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.