The Yhat Blog


machine learning, data science, engineering


Exploring U.S. Traffic Fatality Data

by Ben Van Dyke |


About Ben: Ben is a Data Analyst at DataScience. When not digging into a dataset, you'll find him on a bicycle searching for funk records and the best tacos in LA.

Introduction and Inspiration

At a ChiPy event, Nick Bennett gave an excellent talk on traffic fatalities and how he attempts to visualize the publicly available data. The accompanying GitHub repo shows how he accessed and manipulated some of that data with Python tools and then used a couple of different web mapping services to visualize it. The talk prompted some informative comments from the audience and inspired me to further analyze the data myself.

The National Highway Traffic Safety Administration (NHTSA) maintains a fatality dataset called the Fatality Reporting System (FARS). It contains detailed information about every fatal traffic crash in the U.S., including the time of day, location with latitude and longitude, roadway type, and more. After tracking down the documentation and figuring out how to parse the dBase file format, I was ready to dig into the data. This notebook will hopefully provide some helpful Python analysis techniques as well as raise awareness of the very serious societal problem of roadway injury and death.

Additional Datasets Used

  • Federal Highway Administration - Vehicle Miles Traveled
  • Centers for Disease Control - Causes of Death in the U.S.
  • World Health Organization - Motor Vehicle Death Rates by Country

    import pandas as pd
    import numpy as np
    import matplotlib
    import matplotlib.pyplot as plt
    
    from __future__ import division, print_function
    
    matplotlib.style.use('fivethirtyeight')
    %matplotlib inline
    
    # Load the data into pandas DataFrame
    fatality_frame = pd.read_csv('data/accident.csv')
    

I adopted Nick's approach and used the dbfpy library to convert the FARS 2012 data (ftp) into csv form. The FARS data file is unfortunately titled ACCIDENT. This term implies that we have no control over these events, when in fact we do. Criminal acts such as reckless and drunken driving are not accidents and should not be excused as some inevitable trapping of modern life.

Motor Vehicles Are Second-Leading Cause of Death Due to Injury

# Number of traffic fatalities in the US in 2012 using pandas DataFrame sum function
total_traffic_fatalities = fatality_frame.FATALS.sum()
print("2012 Traffic Fatalities: ", total_traffic_fatalities)

2012 Traffic Fatalities: 33561

There were 33,561 traffic fatalities in the U.S. in 2012, or a little more than 11 for every 100,000 people. To put that in perspective, 39,260 women died from breast cancer and 29,720 men died from prostate cancer in 2013, according to the American Cancer Society. The fight against these cancers generates a lot of public awareness and fundraising. Fore example, in Chicago the lights on top of skyscrapers turn pink for a month every year. Contrast that with a general public apathy to the number of people dying in traffic crashes at rates comparable to the most-common forms of cancer.

In fact, traffic fatalities are the second-leading cause of death due to injury (non-health and disease related) in the U.S. The CDC has death statistics through the year 2010. Here's the bar plot showing fatality rates by injury:

# Get the rates
cdc_injury_frame = pd.read_csv('data/cdc_injuries_2010.txt',delimiter='\t')

# Calculate death rate to due to NaN and string values
cdc_injury_frame['Rate'] = cdc_injury_frame['Deaths']\
                                / (cdc_injury_frame['Population'] / 100000)
# Sort descending
cdc_sort = cdc_injury_frame.sort(columns='Rate', ascending=False)

cdc_rates = cdc_sort.set_index('Injury Mechanism & All Other Leading Causes')['Rate'].order()
# Plot the top 10
plt.figure(figsize=(12,6))
cdc_rates.iloc[-11:].plot(kind='barh',
               title='Motor Vehicles are Second-Leading Cause of Death Due to Injury')
plt.xlabel('Deaths per 100k people, 2010')
plt.ylabel('')
plt.show()

Causes of death due to injury in the US

Motor vehicle traffic is the second longest bar on the plot. Drug-related deaths make up the majority of poisoning deaths, and this number has increased substantially in recent years.

The U.S. Department of Defense provides historical and current data on wartime casualties. The major post-WWII conflicts (Korea, Vietnam, Persian Gulf, Iraq, Afghanistan) have resulted in over 100,000 total U.S. military servicemember deaths since 1950. In that same time span, more than 2.7 million people have been killed by motor vehicles in the U.S. according to data published by the NHTSA.

from matplotlib.ticker import FuncFormatter

# US casualties by conflict (all deaths - KIA and other)
us_war_casualties = {'Korea': 36574,
                    'Vietnam': 58220,
                    'Persian Gulf': 383,
                    'Iraq': 4424,
                    'Afghanistan': 2215}

us_traffic_deaths_1950_2015 = 2729062

trf_mil_deaths = pd.DataFrame.from_dict({'U.S. Military Wartime Casualties 1950-present': sum(us_war_casualties.values()),
                        'U.S. Traffic Deaths 1950-present': us_traffic_deaths_1950_2015}, orient='index')
trf_mil_deaths[0].order().plot(kind='barh')
plt.title('Motor Vehicles Have Claimed 25x More American Lives Than Wars Since 1950')
plt.xlabel('Total Deaths - (millions)')

def form_fun(x, p):

    return str((x / 1000000)) + ''

plt.gca().xaxis.set_major_formatter(FuncFormatter(form_fun))
plt.show()

Fatalities due to traffic v wartime casualities

Motor vehicles kill at a higher rate than firearms and at a comparable rate to drugs. Both of these other issues are discussed at length in the news media and by policymakers. We have a decades-long War on Drugs and recent renewed efforts on restricting assault weapons.

Why is there a lack of public awareness of the death toll caused by our driving culture?

That's a difficult question to answer. The automobile industry is a very important economic engine and source of national pride. The construction of the interstate system through urban areas and accompanying white flight to car-oriented suburbs likely had an impact as well. Since the 1950's, the majority of the built environment in this country has been designed specifically to increase the capacity for automobile travel, often at the expense of other modes. Perhaps we've become so dependent on our cars that we can't confront their deadly impact on our society at large. This is a question that can't be answered in this analysis, but it's important to consider at the same time.

U.S. Roads Much More Deadly Than International Counterparts

The 33,561 fatalities that resulted from motor vehicle collisions in 2012 in the U.S. is certainly a large number, but how does it compare to traffic-related fatalities in other countries in the U.S.'s peer group? The World Health Organization has that data. total

wvf = pd.read_csv('data/world_death_rates.csv', index_col=0)
plt.figure(figsize=(12,6))

ax = wvf['Death_Rate'].plot(kind='bar')
plt.ylabel("Traffic Deaths / 100,000 people")
plt.title("US Traffic Death Rates Nearly Double Those of Peer Group")
plt.xticks(rotation=0)
plt.xlabel('')

rects = ax.patches

def autolabel(rects):
    """Attach some labels."""

    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., height - .3, '%0.1f'%height,
              ha='center', va='top', fontsize=14, color='w')

autolabel(rects)

plt.show()

US Traffic Deaths versus Wealthy Countries

The U.S. does not compare favorably at all against other wealthy countries with large populations. Even other countries with high automobile share, such as Australia and Canada, have nearly half the traffic death rate of the U.S.. The U.S. is wealthier by GDP per capita than the other nations in the chart, so why is our rate of traffic deaths so much higher?

It's not all bad news, though. Traffic fatality rates have actually been declining in the U.S. As recently as 2005, there were more than 40,000 fatalities.

# Load FARS fatality time series
ftf = pd.read_csv('data/fatal_trends.txt',delimiter='\t')
ftf = ftf.set_index('Year')

plt.figure(figsize=(12,6))
ftf['Fatality Rate per 100,000 Population'].plot()

plt.ylim(0)
plt.title('US Fatality Rate is Declining')
plt.ylabel('Deaths per 100k people')

plt.show()

US Fatality Rate over Time

The fatality rate has declined significantly since the early 1990's, with a sharp decrease in the second half of the 2000's.

f, axarr = plt.subplots(1,2,figsize=(12,4))
ftf['VMT (Trillions)'] = ftf['Vehicle Miles Traveled (Billions)'] / 1000
ftf['VMT (Trillions)'].plot(ax=axarr[0], title='Total VMT in the US is Leveling Off', color='black')
axarr[0].set_ylim(0)
axarr[0].set_xlabel('')
axarr[0].set_ylabel('Annual VMT (Trillions)')

ftf['Fatality Rate per 100 Million VMT'].plot(ax=axarr[1], title='Fatality Rate per VMT is Declining',
                                             )
axarr[1].set_xlabel('')
axarr[1].set_ylim(0)
axarr[1].set_ylabel('Deaths per 100M VMT')
plt.show()

png

The absolute number of fatalities has declined, but so has the fatality rate per vehicle miles traveled (VMT), which indicates that we are making progress towards safer roads. Since 1994, the fatality rate has dropped while VMT increased. In recent years, Americans are driving less, with several year-over-year decreases in CMTd since the mid-2000's. The continued decline in the fatality rate - even with a decreasing denominator - is an encouraging sign.

Drunk Driving

One of the first things that comes to mind when I think of traffic fatalities is drunk driving. From a young age, I recall being repeatedly warned about the dangers of drunk driving in school, on television, etc. Penalties are stiff, yet it does not seem to deter significant numbers of people from getting behind the wheel while intoxicated. The FARS data includes a drunk driver indicator, the value in the DRUNKEN_DR column indicates the number of drunk drivers involved in each fatal crash.

# Number of fatalities in crashes involving a drunken driver
drunk_driver_fatalities = fatality_frame.FATALS[fatality_frame.DRUNK_DR >= 1].sum()

print("Fatalities involving a drunk driver: ", drunk_driver_fatalities)
print("Percent of total traffic fatalities involving drunk driver: ",
      '{0:.1f}%'.format(drunk_driver_fatalities / total_traffic_fatalities * 100))

Fatalities involving a drunk driver: 10161 Percent of total traffic fatalities involving drunk driver: 30.3%

Nearly a third of all traffic fatalities involve a drunk driver. Despite all the education and public campaigns and increased enforcement, drunk driving is still taking a massive toll on human life every year.

What else can we learn about drunk driving from the data?

# pandas DataFrame pivot by hour that crash occurred and drunk driving
fatal_pivot = fatality_frame.pivot_table(index=['HOUR'], columns=['DRUNK_DR'],
                                         values='FATALS', aggfunc=np.sum)
# Sum the total number of drunk drivers involved
fatal_pivot['DRUNK_DR_SUM'] = fatal_pivot[[1,2,3,4]].sum(axis=1)

fp = fatal_pivot[[0,'DRUNK_DR_SUM']].iloc[:-1].copy()
fp.columns = ['No Drunk Driver', 'Drunk Driver']

plt.rcParams['figure.figsize'] = (12,6)
fp.plot()

plt.title('Drunk Driving Fatalities Peak in the Late Evening/Early Morning Hours')
plt.ylabel('Total Fatalities, 2012')
plt.xlabel('Hour')

plt.show()

Drunk Driving Fatalities by Time of Day

Clearly the late evening and early morning hours show high levels of drunken driving activity. Fatalities caused by drunk drivers are nearly double those caused by sober drivers between the hours of 2:00 and 4:00 a.m.

I have not been able to find VMT data by hour for the U.S., but this report from the Federal Highway Administration suggests that VMT in the late evening/early morning is a fraction of the peak volume during daytime commuting hours. On a per-VMT basis, the roads at night are more dangerous than the absolute numbers show, as the elevated fatality numbers are observed despite dramatically fewer people driving at those times.

# Now look at day of week
fatal_pivot = fatality_frame.pivot_table(index=['DAY_WEEK'],columns=['DRUNK_DR'],
                                         values='FATALS', aggfunc=np.sum)
# Sum the total number of drunk drivers involved
fatal_pivot['DRUNK_DR_SUM'] = fatal_pivot[[1,2,3,4]].sum(axis=1)
fp = fatal_pivot[[0,'DRUNK_DR_SUM']].copy()
fp.columns = ['No Drunk Driver', 'Drunk Driver']
# Days of week are indexed 1=Sunday, 2=Monday, ..., 6=Saturday
labels=['Sun','Mon','Tue','Wed','Thu','Fri','Sat']
fp.index = labels
fp.plot(kind='bar')

plt.xticks(rotation=0)
plt.ylabel('Total Fatalities, 2012')
plt.title('Drunk Driving Fatalities Peak on Weekends')

plt.show()

Drunk Driving Fatalities by Day

As you might expect, drunk driving fatalities peak substantially on the weekends, with non-drunk fatalities remaining relatively consistent across all days of week.

Weather Conditions

The FARS data contains natural environment features such as LGT_COND and WEATHER which encode information on light conditions (light, dusk, etc) and weather (rain, fog, etc), respectively. Intuitively, I expect more fatalities to occur in darker conditions or harsh weather.

weather_group = fatality_frame.groupby(['WEATHER']).sum()['FATALS']
labels = ['Clear', 'Rain', 'Sleet/Hail', 'Snow', 'Fog, Smog, Smoke',
          'Severe Crosswinds', 'Blowing Sand, Soil, Dirt', 'Other',
          'Cloudy', 'Blowing Snow', 'Not Reported', 'Unknown']
weather_group.index = labels

(weather_group.order() / weather_group.sum()).plot(kind='barh')

plt.title('Most Crashes Occur in Clear Weather Conditions')
plt.xlabel('Proportion of Total Crashes, 2012')

plt.show()

Car Crashes by Weather Condition

The majority of fatalities occur with no weather affecting visibility. Rain is the only precipitation form that shows up significantly. Perhaps people reduce driving during adverse conditions or drive more cautiously - leading to fewer deaths.

# pandas groupby on LGT_COND column
light_group = fatality_frame.groupby(['LGT_COND']).sum()['FATALS']
labels = ['Daylight','Dark - Not Lighted', 'Dark - Lighted',
          'Dawn', 'Dusk', 'Dark - Unknown Lighting', 'Other',
          'Not Reported', 'Unknown']
light_group.index = labels

(light_group.order() /  light_group.sum()).plot(kind='barh')

plt.title('Fatal Crashes are Evenly Split Between Daylight and Darkness')
plt.xlabel('Proportion of Total Crashes, 2012')

plt.show()

Car Crashes in Daylight versus Darkness

At first glance, it appears that more fatalities occur during daylight. However, adding the two dark categories together will almost equal the daylight number. Going back to the drunk driving analysis, I don't have VMT by hour, but it does look like VMT drops off at night. I expect that the rate per VMT is much higher in the dark than it is during the day, similar to what we saw for hour in the drunk driving component.

Conclusions

Performing some base-level analysis on the FARS data was an interesting exercise. I think there's additional avenues to pursue with the dataset itself, especially the geographic elements, as well as joining it with other datasets. Here's what I learned:

  • The U.S. has high rate of traffic fatalities relative to other wealthy countries
  • We have made real progress in reducing deaths in recent years
  • Drunk driving is still a huge social problem causing significant loss of life
  • Fatal drunk driving crashes are more likely to occur late at night, especially on the weekends
  • Weather is not a big factor, fatal crashes occur overwhelmingly in clear atmospheric conditions
  • Traffic deaths are more likely to occur in dark conditions than daylight.

About DataScience:

Based in Culver City, Calif., DataScience, Inc. combines human intellect with machine-powered analysis to extract information from data that drives real business results.

For more about our pro-bono work reducing traffic fatalities as well as our professional work for clients ranging from Tinder to Sonos, visit datascience.com.



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.