The Yhat Blog


machine learning, data science, engineering


Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data

by Viraj Parekh |


This is a basic tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.

This is meant to be used as a general tutorial for beginners with some experience in Python or R.

Step One: What data do we need?

The first step to any data project is getting an idea of what you want. We're going to focus on getting NBA data at a team level on a game by game basis. From my experience, these team level stats usually exist in different places, making them harder to compare across games.

Our goal is to build box scores across a team level to easily compare them against each other. Hopefully this will give some insight as to how a team's play has changed over the course of the season or make it easier to do any other type of analysis.

On a high level, this might look something like:

Game | Days Rest | Total Passes | Total Assists | Passes/Assist | EFG | Outcome

Next step: Where is the data coming from?

stats.nba.com has all the NBA data that's out there, but the harder part is finding a quick way to fetch and manipulate it into the form that's needed (and what most of this tutorial will be about).

Analytics is fun, but everything around it can be tough.

We're going to use the nba_py package

Huge shoutout to https://github.com/seemethere for putting this together.

This is going to focus on team stats, so lets play around a little bit to get a sense of what we're working with.

Start by importing the packages we'll need:

import pandas as pd
from nba_py import team

If you're using jupyter notebooks notebooks you can pip-install any packages you don't have straight from the notebook using:

%%bash
pip install nba_py

If you're using Yhat's Python IDE, Rodeo you can install nba_py in the packages tab.

Install packages in the Packages tab. No surprises here.

So referring to the docs, it looks like we'll need some sort of roster id to get data for each team. This api hits an endpoint on the NBA"s website, so the IDs are most likely in the URL:

(Unapologetic Knicks bias) Looking at the team page for the on stats.nba.com, here's the url: http://stats.nba.com/team/#!/1610612752/

That number at the end looks like a team ID. Let's see how the passing data works:

class nba_py.team.TeamPassTracking(team_id, measure_type='Base', per_mode='PerGame', plus_minus='N', pace_adjust='N', rank='N', league_id='00', season='2016-17', season_type='Regular Season', po_round='0', outcome='', location='', month='0', season_segment='', date_from='', date_to='', opponent_team_id='0', vs_conference='', vs_division='', game_segment='', period='0', shot_clock_range='', last_n_games='0')

passes_made() passes_recieved()

knicks = team.TeamPassTracking(1610612752)

All the info is stored in the knicks object:

#the dataframe.head(N) command returns the first N rows of a dataframe
knicks.passes_made().head(10)
TEAM_ID TEAM_NAME PASS_TYPE G PASS_FROM PASS_TEAMMATE_PLAYER_ID FREQUENCY PASS AST FGM FGA FG_PCT FG2M FG2A FG2_PCT FG3M FG3A FG3_PCT
0 1610612752 New York Knicks made 64 Rose, Derrick 201565 0.144 56.73 4.42 6.30 14.02 0.449 4.34 8.64 0.503 1.95 5.38 0.363
1 1610612752 New York Knicks made 58 Jennings, Brandon 201943 0.111 48.22 4.93 7.09 15.47 0.458 5.31 10.50 0.506 1.78 4.97 0.358
2 1610612752 New York Knicks made 66 Porzingis, Kristaps 204001 0.106 40.61 1.47 3.29 7.65 0.430 2.56 5.50 0.466 0.73 2.15 0.338
3 1610612752 New York Knicks made 46 Noah, Joakim 201149 0.073 40.20 2.24 4.17 8.85 0.472 3.43 6.93 0.495 0.74 1.91 0.386
4 1610612752 New York Knicks made 72 Anthony, Carmelo 2546 0.102 35.83 2.88 4.18 9.65 0.433 3.13 6.99 0.447 1.06 2.67 0.396
5 1610612752 New York Knicks made 73 Lee, Courtney 201584 0.090 30.92 2.33 3.92 8.42 0.465 3.01 5.97 0.505 0.90 2.45 0.369
6 1610612752 New York Knicks made 68 Hernangomez, Willy 1626195 0.076 28.26 1.25 2.32 5.50 0.422 1.74 3.93 0.442 0.59 1.57 0.374
7 1610612752 New York Knicks made 46 Baker, Ron 1627758 0.045 24.93 1.87 2.61 5.72 0.456 1.93 3.80 0.509 0.67 1.91 0.352
8 1610612752 New York Knicks made 46 Thomas, Lance 202498 0.042 23.24 0.76 1.93 4.67 0.414 1.70 3.78 0.448 0.24 0.89 0.268
9 1610612752 New York Knicks made 75 O'Quinn, Kyle 203124 0.068 22.93 1.49 2.35 4.87 0.482 1.93 3.63 0.533 0.41 1.24 0.333

Did you know you can inspect, copy and save data frames in the History tab in Rodeo?

Referring back to the docs, this looks like per game averages for passes. Definitely a lot that can be done with this, but let's try to get it for a specific game. Referring to the docs:

knicks_last_game = team.TeamPassTracking(1610612752, last_n_games =  1)
knicks_last_game.passes_made().head(10)
TEAM_ID TEAM_NAME PASS_TYPE G PASS_FROM PASS_TEAMMATE_PLAYER_ID FREQUENCY PASS AST FGM FGA FG_PCT FG2M FG2A FG2_PCT FG3M FG3A FG3_PCT
0 1610612752 New York Knicks made 1 Baker, Ron 1627758 0.212 72.0 6.0 7.0 15.0 0.467 7.0 11.0 0.636 0.0 4.0 0.000
1 1610612752 New York Knicks made 1 Ndour, Maurice 1626254 0.135 46.0 1.0 3.0 9.0 0.333 3.0 4.0 0.750 0.0 5.0 0.000
2 1610612752 New York Knicks made 1 Anthony, Carmelo 2546 0.126 43.0 2.0 5.0 16.0 0.313 4.0 13.0 0.308 1.0 3.0 0.333
3 1610612752 New York Knicks made 1 O'Quinn, Kyle 203124 0.118 40.0 5.0 5.0 6.0 0.833 4.0 4.0 1.000 1.0 2.0 0.500
4 1610612752 New York Knicks made 1 Lee, Courtney 201584 0.118 40.0 3.0 6.0 8.0 0.750 2.0 4.0 0.500 4.0 4.0 1.000
5 1610612752 New York Knicks made 1 Hernangomez, Willy 1626195 0.082 28.0 3.0 4.0 8.0 0.500 4.0 6.0 0.667 0.0 2.0 0.000
6 1610612752 New York Knicks made 1 Holiday, Justin 203200 0.071 24.0 3.0 4.0 7.0 0.571 4.0 6.0 0.667 0.0 1.0 0.000
7 1610612752 New York Knicks made 1 Kuzminskas, Mindaugas 1627851 0.059 20.0 2.0 2.0 6.0 0.333 2.0 5.0 0.400 0.0 1.0 0.000
8 1610612752 New York Knicks made 1 Randle, Chasson 1626184 0.044 15.0 0.0 0.0 1.0 0.000 0.0 1.0 0.000 0.0 0.0 NaN
9 1610612752 New York Knicks made 1 Vujacic, Sasha 2756 0.035 12.0 1.0 2.0 3.0 0.667 2.0 2.0 1.000 0.0 1.0 0.000

This looks clean enough to be wrangled into a form that can be worked with.

If we're trying to create a team level box score, we're more than likely going to need to join tables together down the line, just something to keep in mind.

Hitting the ShotTracking endpoint looks interesting:

knicks_id = 1610612752
knicks_shots = team.TeamShotTracking(knicks_id, last_n_games =  1)


knicks_shots.closest_defender_shooting()
TEAM_ID TEAM_NAME SORT_ORDER G CLOSE_DEF_DIST_RANGE FGA_FREQUENCY FGM FGA FG_PCT EFG_PCT FG2A_FREQUENCY FG2M FG2A FG2_PCT FG3A_FREQUENCY FG3M FG3A FG3_PCT
0 1610612752 New York Knicks 1 1 0-2 Feet - Very Tight 0.091 4.0 8.0 0.500 0.500 0.091 4.0 8.0 0.500 0.000 0.0 0.0 NaN
1 1610612752 New York Knicks 2 1 2-4 Feet - Tight 0.318 15.0 28.0 0.536 0.536 0.295 15.0 26.0 0.577 0.023 0.0 2.0 0.000
2 1610612752 New York Knicks 3 1 4-6 Feet - Open 0.409 16.0 36.0 0.444 0.500 0.250 12.0 22.0 0.545 0.159 4.0 14.0 0.286
3 1610612752 New York Knicks 4 1 6+ Feet - Wide Open 0.182 7.0 16.0 0.438 0.500 0.102 5.0 9.0 0.556 0.080 2.0 7.0 0.286

Following along in Rodeo? Your view should look something like this.

This looks interesting! We wanted EFG% (effective field goal percentage) in our original table, but it looks like we can get EFG% for open and covered shots. Let's group 'Open' and 'Wide Open' together, along with 'Tight' and 'Very Tight.'

Effective field goal percentage is a statistic that adjusts field goal percentage to account for the fact that three-point field goals count for three points while field goals only count for two points:

This might help answer questions like "Do teams hit more open shots when they win?"

df_grouped = knicks_shots.closest_defender_shooting()

df_grouped['OPEN'] = df_grouped['CLOSE_DEF_DIST_RANGE'].map(lambda x : True if 'Open' in x else False)
##This creates a new column  OPEN,  mapped from the 'CLOSE_DEF_DIST_RANGE' column.
##http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html


df_grouped
TEAM_ID TEAM_NAME SORT_ORDER G CLOSE_DEF_DIST_RANGE FGA_FREQUENCY FGM FGA FG_PCT EFG_PCT FG2A_FREQUENCY FG2M FG2A FG2_PCT FG3A_FREQUENCY FG3M FG3A FG3_PCT OPEN
0 1610612752 New York Knicks 1 1 0-2 Feet - Very Tight 0.091 4.0 8.0 0.500 0.500 0.091 4.0 8.0 0.500 0.000 0.0 0.0 NaN False
1 1610612752 New York Knicks 2 1 2-4 Feet - Tight 0.318 15.0 28.0 0.536 0.536 0.295 15.0 26.0 0.577 0.023 0.0 2.0 0.000 False
2 1610612752 New York Knicks 3 1 4-6 Feet - Open 0.409 16.0 36.0 0.444 0.500 0.250 12.0 22.0 0.545 0.159 4.0 14.0 0.286 True
3 1610612752 New York Knicks 4 1 6+ Feet - Wide Open 0.182 7.0 16.0 0.438 0.500 0.102 5.0 9.0 0.556 0.080 2.0 7.0 0.286 True

The last column 'OPEN' gives us the information we need. Now we can aggregate based off of it. Let's get the total number of open shots.

total_open_shots  = df_grouped.loc[df_grouped['OPEN'] == True, 'FGA'].sum()
print total_open_shots

52.0

That looks like it worked. Similarly, we can get the total number of "covered" shots taken (looks like it's a lot higher...nothing surprising there.)

Keep in mind, this is a bit misleading, as layups and other shots near the basket are more likely to have a nearby defender.

Referring to the definition for EFG%:

$$EFG = \frac{(FGM + .5 * 3PM)}{FGA}$$

We definitely have all the information we need to compute this for open and covered shots:

#Mapping the formula above into a column:
open_efg = (df_grouped.loc[df_grouped['OPEN']== True, 'FGM'].sum() + (.5 * df_grouped.loc[df_grouped['OPEN']== True, 'FG3M'].sum()))/(df_grouped.loc[df_grouped['OPEN']== True, 'FGA'].sum())
covered_efg = (df_grouped.loc[df_grouped['OPEN']== False, 'FGM'].sum() + (.5 * df_grouped.loc[df_grouped['OPEN']== False, 'FG3M'].sum()))/(df_grouped.loc[df_grouped['OPEN']== False, 'FGA'].sum())

print open_efg
print covered_efg


0.5
0.527777777778

Interesting... shooting better when there's a defender nearby makes it look like there's more to the story. Then again, nothing about the Knicks ever seems to makes sense.

Referring back to the original plan, it looks like we have most of the stats we set out to get. However, we still haven't addressed:

1) Who was the game against? Who won?
2) How many days rest did each team have?
3) How are we going to get all this data together?

From the looks of it, there isn't anything in the nba_py team modules we're using that can be directly used as an identifier.

However, it looks like we can get stats for date ranges. To test this, let's look at a single game the Knicks played on Sunday, January 29th:

date = '2017-01-29'
knicks_jan = team.TeamShotTracking(knicks_id, date_from = date, date_to =  date)

knicks_jan_shots =  knicks_jan.closest_defender_shooting()

knicks_jan_shots
TEAM_ID TEAM_NAME SORT_ORDER G CLOSE_DEF_DIST_RANGE FGA_FREQUENCY FGM FGA FG_PCT EFG_PCT FG2A_FREQUENCY FG2M FG2A FG2_PCT FG3A_FREQUENCY FG3M FG3A FG3_PCT
0 1610612752 New York Knicks 1 1 0-2 Feet - Very Tight 0.156 6.0 20.0 0.300 0.300 0.148 6.0 19.0 0.316 0.008 0.0 1.0 0.000
1 1610612752 New York Knicks 2 1 2-4 Feet - Tight 0.344 23.0 44.0 0.523 0.591 0.258 17.0 33.0 0.515 0.086 6.0 11.0 0.545
2 1610612752 New York Knicks 3 1 4-6 Feet - Open 0.320 13.0 41.0 0.317 0.390 0.156 7.0 20.0 0.350 0.164 6.0 21.0 0.286
3 1610612752 New York Knicks 4 1 6+ Feet - Wide Open 0.180 9.0 23.0 0.391 0.522 0.039 3.0 5.0 0.600 0.141 6.0 18.0 0.333

A quick check of the box score confirms that the Knicks shot a total of 128, so it looks like adding a date field will work out. We'll just need to figure out which dates to pass in:

We still don't know what the outcome was, so let's jump back into the docs to see if another module will help out.

#Hitting another endpoint
knicks_log = team.TeamGameLogs(knicks_id)

knicks_log.info()
Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN FGM ... FT_PCT OREB DREB REB AST STL BLK TOV PF PTS
0 1610612752 0021601160 APR 04, 2017 NYK vs. CHI W 30 48 0.385 240 42 ... 0.625 16 37 53 26 5 7 15 22 100
1 1610612752 0021601145 APR 02, 2017 NYK vs. BOS L 29 48 0.377 240 33 ... 0.840 8 24 32 20 12 2 11 20 94
2 1610612752 0021601133 MAR 31, 2017 NYK @ MIA W 29 47 0.382 240 38 ... 0.941 8 31 39 25 9 5 14 18 98
3 1610612752 0021601115 MAR 29, 2017 NYK vs. MIA L 28 47 0.373 240 33 ... 0.810 17 35 52 19 2 6 14 16 88
4 1610612752 0021601098 MAR 27, 2017 NYK vs. DET W 28 46 0.378 240 45 ... 0.923 4 33 37 26 13 5 12 16 109
5 1610612752 0021601085 MAR 25, 2017 NYK @ SAS L 27 46 0.370 240 41 ... 0.867 12 33 45 24 6 5 16 16 98
6 1610612752 0021601071 MAR 23, 2017 NYK @ POR L 27 45 0.375 240 36 ... 0.900 9 31 40 23 5 9 11 20 95
7 1610612752 0021601066 MAR 22, 2017 NYK @ UTA L 27 44 0.380 240 38 ... 0.889 9 27 36 19 5 1 11 26 101
8 1610612752 0021601050 MAR 20, 2017 NYK @ LAC L 27 43 0.386 240 40 ... 0.792 14 34 48 24 6 1 12 19 105
9 1610612752 0021601016 MAR 16, 2017 NYK vs. BKN L 27 42 0.391 240 41 ... 0.962 5 29 34 20 6 4 7 26 110
10 1610612752 0021601001 MAR 14, 2017 NYK vs. IND W 27 41 0.397 240 35 ... 0.615 11 41 52 21 8 4 14 15 87
11 1610612752 0021600986 MAR 12, 2017 NYK @ BKN L 26 41 0.388 240 39 ... 0.813 11 32 43 22 5 8 9 20 112
12 1610612752 0021600975 MAR 11, 2017 NYK @ DET L 26 40 0.394 240 36 ... 0.636 8 36 44 26 4 7 18 18 92
13 1610612752 0021600952 MAR 08, 2017 NYK @ MIL L 26 39 0.400 240 39 ... 0.667 10 33 43 22 4 6 15 20 93
14 1610612752 0021600935 MAR 06, 2017 NYK @ ORL W 26 38 0.406 240 40 ... 0.964 12 33 45 26 6 1 9 23 113
15 1610612752 0021600928 MAR 05, 2017 NYK vs. GSW L 25 38 0.397 240 39 ... 0.800 12 35 47 18 5 6 15 20 105
16 1610612752 0021600909 MAR 03, 2017 NYK @ PHI L 25 37 0.403 240 33 ... 0.879 9 32 41 14 10 3 10 20 102
17 1610612752 0021600895 MAR 01, 2017 NYK @ ORL W 25 36 0.410 240 34 ... 0.806 13 37 50 21 9 3 11 16 101
18 1610612752 0021600882 FEB 27, 2017 NYK vs. TOR L 24 36 0.400 240 33 ... 0.842 8 32 40 17 10 6 17 19 91
19 1610612752 0021600868 FEB 25, 2017 NYK vs. PHI W 24 35 0.407 240 43 ... 0.783 10 34 44 21 6 7 11 22 110
20 1610612752 0021600853 FEB 23, 2017 NYK @ CLE L 23 35 0.397 240 42 ... 0.706 16 34 50 24 4 7 12 19 104
21 1610612752 0021600845 FEB 15, 2017 NYK @ OKC L 23 34 0.404 240 41 ... 0.857 6 33 39 19 8 12 15 21 105
22 1610612752 0021600817 FEB 12, 2017 NYK vs. SAS W 23 33 0.411 240 34 ... 0.810 5 39 44 18 5 8 19 19 94
23 1610612752 0021600800 FEB 10, 2017 NYK vs. DEN L 22 33 0.400 240 52 ... 0.600 10 23 33 36 10 5 10 14 123
24 1610612752 0021600791 FEB 08, 2017 NYK vs. LAC L 22 32 0.407 240 46 ... 0.833 12 29 41 25 9 5 11 22 115
25 1610612752 0021600768 FEB 06, 2017 NYK vs. LAL L 22 31 0.415 240 37 ... 0.788 6 34 40 16 4 4 16 24 107
26 1610612752 0021600759 FEB 04, 2017 NYK vs. CLE L 22 30 0.423 240 39 ... 0.500 13 29 42 23 9 7 10 20 104
27 1610612752 0021600733 FEB 01, 2017 NYK @ BKN W 22 29 0.431 240 35 ... 0.613 21 37 58 23 16 7 13 18 95
28 1610612752 0021600724 JAN 31, 2017 NYK @ WAS L 21 29 0.420 240 34 ... 0.800 22 29 51 18 8 2 12 17 101
29 1610612752 0021600711 JAN 29, 2017 NYK @ ATL L 21 28 0.429 340 51 ... 0.826 15 48 63 32 9 11 12 39 139
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48 1610612752 0021600456 DEC 25, 2016 NYK vs. BOS L 16 14 0.533 240 41 ... 0.889 17 32 49 11 5 6 17 23 114
49 1610612752 0021600438 DEC 22, 2016 NYK vs. ORL W 16 13 0.552 240 41 ... 0.882 18 34 52 26 9 9 15 18 106
50 1610612752 0021600421 DEC 20, 2016 NYK vs. IND W 15 13 0.536 240 44 ... 0.810 4 41 45 24 6 8 14 18 118
51 1610612752 0021600404 DEC 17, 2016 NYK @ DEN L 14 13 0.519 240 35 ... 0.923 9 26 35 18 6 4 11 27 114
52 1610612752 0021600388 DEC 15, 2016 NYK @ GSW L 14 12 0.538 240 38 ... 0.474 14 35 49 19 10 5 11 10 90
53 1610612752 0021600372 DEC 13, 2016 NYK @ PHX L 14 11 0.560 265 38 ... 0.737 11 32 43 23 10 4 13 27 111
54 1610612752 0021600360 DEC 11, 2016 NYK @ LAL W 14 10 0.583 240 41 ... 0.839 8 36 44 21 9 11 10 15 118
55 1610612752 0021600345 DEC 09, 2016 NYK @ SAC W 13 10 0.565 240 36 ... 0.840 12 42 54 22 3 6 16 25 103
56 1610612752 0021600327 DEC 07, 2016 NYK vs. CLE L 12 10 0.545 240 35 ... 0.867 13 30 43 22 6 3 16 22 94
57 1610612752 0021600316 DEC 06, 2016 NYK @ MIA W 12 9 0.571 240 48 ... 0.688 18 35 53 22 6 6 10 18 114
58 1610612752 0021600302 DEC 04, 2016 NYK vs. SAC W 11 9 0.550 240 39 ... 0.708 14 44 58 20 5 10 18 25 106
59 1610612752 0021600285 DEC 02, 2016 NYK vs. MIN W 10 9 0.526 240 41 ... 0.800 11 32 43 26 9 6 15 18 118
60 1610612752 0021600271 NOV 30, 2016 NYK @ MIN W 9 9 0.500 240 41 ... 0.733 12 27 39 24 8 3 13 26 106
61 1610612752 0021600255 NOV 28, 2016 NYK vs. OKC L 8 9 0.471 240 36 ... 0.893 12 28 40 20 9 11 5 16 103
62 1610612752 0021600241 NOV 26, 2016 NYK @ CHA L 8 8 0.500 240 37 ... 0.800 11 36 47 26 9 6 8 27 102
63 1610612752 0021600228 NOV 25, 2016 NYK vs. CHA W 8 7 0.533 265 45 ... 0.923 13 42 55 26 9 8 16 21 113
64 1610612752 0021600208 NOV 22, 2016 NYK vs. POR W 7 7 0.500 240 45 ... 1.000 10 33 43 26 8 5 13 23 107
65 1610612752 0021600193 NOV 20, 2016 NYK vs. ATL W 6 7 0.462 240 42 ... 0.714 11 39 50 21 8 1 15 23 104
66 1610612752 0021600169 NOV 17, 2016 NYK @ WAS L 5 7 0.417 240 41 ... 0.900 10 26 36 23 9 1 13 20 112
67 1610612752 0021600162 NOV 16, 2016 NYK vs. DET W 5 6 0.455 240 42 ... 0.632 19 33 52 24 8 9 9 11 105
68 1610612752 0021600146 NOV 14, 2016 NYK vs. DAL W 4 6 0.400 240 34 ... 0.889 14 37 51 18 5 5 17 16 93
69 1610612752 0021600131 NOV 12, 2016 NYK @ TOR L 3 6 0.333 240 44 ... 0.750 17 32 49 19 3 2 16 23 107
70 1610612752 0021600125 NOV 11, 2016 NYK @ BOS L 3 5 0.375 240 33 ... 0.882 21 36 57 19 7 11 25 26 87
71 1610612752 0021600106 NOV 09, 2016 NYK vs. BKN W 3 4 0.429 240 44 ... 0.706 9 41 50 25 11 5 14 21 110
72 1610612752 0021600087 NOV 06, 2016 NYK vs. UTA L 2 4 0.333 240 42 ... 0.895 10 29 39 18 8 5 12 26 109
73 1610612752 0021600073 NOV 04, 2016 NYK @ CHI W 2 3 0.400 240 46 ... 0.762 11 29 40 32 7 2 5 23 117
74 1610612752 0021600058 NOV 02, 2016 NYK vs. HOU L 1 3 0.250 240 37 ... 0.680 7 27 34 18 10 6 16 22 99
75 1610612752 0021600050 NOV 01, 2016 NYK @ DET L 1 2 0.333 240 35 ... 0.800 8 35 43 18 6 9 11 20 89
76 1610612752 0021600028 OCT 29, 2016 NYK vs. MEM W 1 1 0.500 240 40 ... 0.641 6 35 41 24 4 4 12 25 111
77 1610612752 0021600001 OCT 25, 2016 NYK @ CLE L 0 1 0.000 240 32 ... 0.750 13 29 42 17 6 6 18 22 88

78 rows × 27 columns

Looks like this can be manipulated to get rest days:

df_game_log = knicks_log.info()

df_game_log['GAME_DATE']=pd.to_datetime(df_game_log['GAME_DATE'])  ##converting the columns datatype

df_game_log['DAYS_REST'] = df_game_log['GAME_DATE']- df_game_log['GAME_DATE'].shift(-1)

df_game_log.head()
Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN FGM ... OREB DREB REB AST STL BLK TOV PF PTS DAYS_REST
0 1610612752 0021601160 2017-04-04 NYK vs. CHI W 30 48 0.385 240 42 ... 16 37 53 26 5 7 15 22 100 2 days
1 1610612752 0021601145 2017-04-02 NYK vs. BOS L 29 48 0.377 240 33 ... 8 24 32 20 12 2 11 20 94 2 days
2 1610612752 0021601133 2017-03-31 NYK @ MIA W 29 47 0.382 240 38 ... 8 31 39 25 9 5 14 18 98 2 days
3 1610612752 0021601115 2017-03-29 NYK vs. MIA L 28 47 0.373 240 33 ... 17 35 52 19 2 6 14 16 88 2 days
4 1610612752 0021601098 2017-03-27 NYK vs. DET W 28 46 0.378 240 45 ... 4 33 37 26 13 5 12 16 109 2 days

5 rows × 28 columns

df_game_log.dtypes




Team_ID                int64
Game_ID               object
GAME_DATE     datetime64[ns]
MATCHUP               object
WL                    object
W                      int64
L                      int64
W_PCT                float64
MIN                    int64
FGM                    int64
FGA                    int64
FG_PCT               float64
FG3M                   int64
FG3A                   int64
FG3_PCT              float64
FTM                    int64
FTA                    int64
FT_PCT               float64
OREB                   int64
DREB                   int64
REB                    int64
AST                    int64
STL                    int64
BLK                    int64
TOV                    int64
PF                     int64
PTS                    int64
DAYS_REST    timedelta64[ns]
dtype: object


#We have the information we need, but it's not the right data type. To switch it back:
df_game_log['DAYS_REST'] =  df_game_log['DAYS_REST'].astype('timedelta64[D]')
df_game_log.dtypes


Team_ID               int64
Game_ID              object
GAME_DATE    datetime64[ns]
MATCHUP              object
WL                   object
W                     int64
L                     int64
W_PCT               float64
MIN                   int64
FGM                   int64
FGA                   int64
FG_PCT              float64
FG3M                  int64
FG3A                  int64
FG3_PCT             float64
FTM                   int64
FTA                   int64
FT_PCT              float64
OREB                  int64
DREB                  int64
REB                   int64
AST                   int64
STL                   int64
BLK                   int64
TOV                   int64
PF                    int64
PTS                   int64
DAYS_REST           float64
dtype: object

This looks like we'll get all the info for all games. We'll start by appending the information for a single game and then try to do it for all dates:

df_game_log.head()
Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN FGM ... OREB DREB REB AST STL BLK TOV PF PTS DAYS_REST
0 1610612752 0021601160 2017-04-04 NYK vs. CHI W 30 48 0.385 240 42 ... 16 37 53 26 5 7 15 22 100 2.0
1 1610612752 0021601145 2017-04-02 NYK vs. BOS L 29 48 0.377 240 33 ... 8 24 32 20 12 2 11 20 94 2.0
2 1610612752 0021601133 2017-03-31 NYK @ MIA W 29 47 0.382 240 38 ... 8 31 39 25 9 5 14 18 98 2.0
3 1610612752 0021601115 2017-03-29 NYK vs. MIA L 28 47 0.373 240 33 ... 17 35 52 19 2 6 14 16 88 2.0
4 1610612752 0021601098 2017-03-27 NYK vs. DET W 28 46 0.378 240 45 ... 4 33 37 26 13 5 12 16 109 2.0

5 rows × 28 columns

#Get the dates from the game logs and pass them into the other functions:

dates = df_game_log['GAME_DATE']

print len(dates)


78


#We have the first date, so now to get the relevant passing and shot info we need:
date = dates[2] ##picking a random date

game_info = team.TeamPassTracking(knicks_id, date_from =date, date_to = date).passes_made()
game_info['GAME_DATE'] = date


game_info
TEAM_ID TEAM_NAME PASS_TYPE G PASS_FROM PASS_TEAMMATE_PLAYER_ID FREQUENCY PASS AST FGM FGA FG_PCT FG2M FG2A FG2_PCT FG3M FG3A FG3_PCT GAME_DATE
0 1610612752 New York Knicks made 1 Baker, Ron 1627758 0.238 67.0 6.0 7.0 18.0 0.389 5.0 14.0 0.357 2.0 4.0 0.5 2017-03-31
1 1610612752 New York Knicks made 1 Vujacic, Sasha 2756 0.149 42.0 7.0 8.0 15.0 0.533 5.0 9.0 0.556 3.0 6.0 0.5 2017-03-31
2 1610612752 New York Knicks made 1 Hernangomez, Willy 1626195 0.131 37.0 2.0 3.0 5.0 0.600 3.0 5.0 0.600 0.0 0.0 NaN 2017-03-31
3 1610612752 New York Knicks made 1 Porzingis, Kristaps 204001 0.113 32.0 3.0 4.0 9.0 0.444 4.0 7.0 0.571 0.0 2.0 0.0 2017-03-31
4 1610612752 New York Knicks made 1 Lee, Courtney 201584 0.106 30.0 1.0 4.0 6.0 0.667 4.0 5.0 0.800 0.0 1.0 0.0 2017-03-31
5 1610612752 New York Knicks made 1 O'Quinn, Kyle 203124 0.096 27.0 3.0 3.0 5.0 0.600 3.0 5.0 0.600 0.0 0.0 NaN 2017-03-31
6 1610612752 New York Knicks made 1 Holiday, Justin 203200 0.082 23.0 2.0 4.0 5.0 0.800 3.0 3.0 1.000 1.0 2.0 0.5 2017-03-31
7 1610612752 New York Knicks made 1 Randle, Chasson 1626184 0.057 16.0 0.0 0.0 2.0 0.000 0.0 2.0 0.000 0.0 0.0 NaN 2017-03-31
8 1610612752 New York Knicks made 1 Ndour, Maurice 1626254 0.028 8.0 1.0 1.0 2.0 0.500 1.0 2.0 0.500 0.0 0.0 NaN 2017-03-31
##Sum everything by GAME_DATE, similar to SQL-style aggregation/groupby:
df_sum = game_info.groupby(['GAME_DATE']).sum()


df_sum.reset_index(level =  0,  inplace =  True)
df_sum
GAME_DATE TEAM_ID G PASS_TEAMMATE_PLAYER_ID FREQUENCY PASS AST FGM FGA FG_PCT FG2M FG2A FG2_PCT FG3M FG3A FG3_PCT
0 2017-03-31 14495514768 9 7321056 1.0 282.0 25.0 34.0 67.0 4.533 28.0 52.0 4.984 6.0 15.0 1.5

When we merge this row back up to the bigger dataframe, we can drop the columns we don't need.

shot_info = team.TeamShotTracking(knicks_id, date_from = date, date_to = date).closest_defender_shooting()


shot_info
TEAM_ID TEAM_NAME SORT_ORDER G CLOSE_DEF_DIST_RANGE FGA_FREQUENCY FGM FGA FG_PCT EFG_PCT FG2A_FREQUENCY FG2M FG2A FG2_PCT FG3A_FREQUENCY FG3M FG3A FG3_PCT
0 1610612752 New York Knicks 1 1 0-2 Feet - Very Tight 0.200 8.0 16.0 0.500 0.500 0.200 8.0 16.0 0.500 0.000 0.0 0.0 NaN
1 1610612752 New York Knicks 2 1 2-4 Feet - Tight 0.450 15.0 36.0 0.417 0.417 0.425 15.0 34.0 0.441 0.025 0.0 2.0 0.0
2 1610612752 New York Knicks 3 1 4-6 Feet - Open 0.263 11.0 21.0 0.524 0.619 0.138 7.0 11.0 0.636 0.125 4.0 10.0 0.4
3 1610612752 New York Knicks 4 1 6+ Feet - Wide Open 0.088 4.0 7.0 0.571 0.714 0.038 2.0 3.0 0.667 0.050 2.0 4.0 0.5
#From  earlier
shot_info['OPEN'] = shot_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
df_sum['OPEN_SHOTS'] = shot_info.loc[shot_info['OPEN']== True, 'FGA'].sum()
df_sum['OPEN_EFG'] = (shot_info.loc[shot_info['OPEN']== True, 'FGM'].sum() + (.5 * shot_info.loc[shot_info['OPEN']== True, 'FG3M'].sum()))/(shot_info.loc[shot_info['OPEN']== True, 'FGA'].sum())
df_sum['COVERED_EFG']= (shot_info.loc[shot_info['OPEN']== False, 'FGM'].sum() + (.5 * shot_info.loc[shot_info['OPEN']== False, 'FG3M'].sum()))/(shot_info.loc[shot_info['OPEN']== False, 'FGA'].sum())


df_sum
GAME_DATE TEAM_ID G PASS_TEAMMATE_PLAYER_ID FREQUENCY PASS AST FGM FGA FG_PCT FG2M FG2A FG2_PCT FG3M FG3A FG3_PCT OPEN_SHOTS OPEN_EFG COVERED_EFG
0 2017-03-31 14495514768 9 7321056 1.0 282.0 25.0 34.0 67.0 4.533 28.0 52.0 4.984 6.0 15.0 1.5 28.0 0.642857 0.442308

Now to append the columns we need back up. This is going to work like a SQL left-join.

df_custom_boxscore  =  pd.merge(df_game_log, df_sum[['PASS', 'FG2M', 'OPEN_SHOTS' ,'OPEN_EFG', 'COVERED_EFG']], how = 'left', left_on = df_game_log['GAME_DATE'],  right_on =  df_sum['GAME_DATE'])


df_custom_boxscore.head(10)
Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN FGM ... BLK TOV PF PTS DAYS_REST PASS FG2M OPEN_SHOTS OPEN_EFG COVERED_EFG
0 1610612752 0021601160 2017-04-04 NYK vs. CHI W 30 48 0.385 240 42 ... 7 15 22 100 2.0 NaN NaN NaN NaN NaN
1 1610612752 0021601145 2017-04-02 NYK vs. BOS L 29 48 0.377 240 33 ... 2 11 20 94 2.0 NaN NaN NaN NaN NaN
2 1610612752 0021601133 2017-03-31 NYK @ MIA W 29 47 0.382 240 38 ... 5 14 18 98 2.0 282.0 28.0 28.0 0.642857 0.442308
3 1610612752 0021601115 2017-03-29 NYK vs. MIA L 28 47 0.373 240 33 ... 6 14 16 88 2.0 NaN NaN NaN NaN NaN
4 1610612752 0021601098 2017-03-27 NYK vs. DET W 28 46 0.378 240 45 ... 5 12 16 109 2.0 NaN NaN NaN NaN NaN
5 1610612752 0021601085 2017-03-25 NYK @ SAS L 27 46 0.370 240 41 ... 5 16 16 98 2.0 NaN NaN NaN NaN NaN
6 1610612752 0021601071 2017-03-23 NYK @ POR L 27 45 0.375 240 36 ... 9 11 20 95 1.0 NaN NaN NaN NaN NaN
7 1610612752 0021601066 2017-03-22 NYK @ UTA L 27 44 0.380 240 38 ... 1 11 26 101 2.0 NaN NaN NaN NaN NaN
8 1610612752 0021601050 2017-03-20 NYK @ LAC L 27 43 0.386 240 40 ... 1 12 19 105 4.0 NaN NaN NaN NaN NaN
9 1610612752 0021601016 2017-03-16 NYK vs. BKN L 27 42 0.391 240 41 ... 4 7 26 110 2.0 NaN NaN NaN NaN NaN

10 rows × 33 columns

Looks like everything joined correctly for exactly the date we chose. Let's make some modifications and then work on a script to join the rest of the dates.

df_custom_boxscore['PASS_AST'] = df_custom_boxscore['PASS'] / df_custom_boxscore['AST']

#It's easier to work with dichotomos variables as binaries instead of letters
df_custom_boxscore['RESULT'] = df_custom_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0)

We should be good to go!

Put all the steps above into a function:

def custom_boxscore(roster_id):

    game_logs  = team.TeamGameLogs(roster_id)

    df_game_logs = game_logs.info()
    df_game_logs['GAME_DATE'] =  pd.to_datetime(df_game_logs['GAME_DATE'])
    df_game_logs['DAYS_REST'] =  df_game_logs['GAME_DATE'] - df_game_logs['GAME_DATE'].shift(-1)
    df_game_logs['DAYS_REST'] =  df_game_logs['DAYS_REST'].astype('timedelta64[D]')

    ##Just like before, that should get us the gamelogs we need and the rest days column

    ##Now to loop through the list of dates for our other stats

    ##This will build up a dataframe of the custom stats and join that to the gamelogs
    df_all =pd.DataFrame() ##blank dataframe

    dates = df_game_logs['GAME_DATE']

    for date in dates:

        game_info = team.TeamPassTracking(roster_id,  date_from=date, date_to=date).passes_made()
        game_info['GAME_DATE'] = date ## We need to append the date to this so we can  join back

        temp_df = game_info.groupby(['GAME_DATE']).sum()
        temp_df.reset_index(level =  0,  inplace =  True)

        ##now to get the shot info. For the most part, we're just reusing code we've already written
        open_info =  team.TeamShotTracking(roster_id,date_from =date,  date_to =  date).closest_defender_shooting()
        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda x: True if 'Open' in x else False)

        temp_df['OPEN_SHOTS'] = open_info.loc[open_info['OPEN'] == True, 'FGA'].sum()
        temp_df['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        temp_df['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())

        ##append this to our bigger dataframe
        df_all = df_all.append(temp_df)

    df_boxscore =  pd.merge(df_game_logs, df_all[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS', 'OPEN_EFG', 'COVERED_EFG']], how = 'left', left_on = df_game_logs['GAME_DATE'], right_on = df_all['GAME_DATE'])
    df_boxscore['PASS_AST'] = df_boxscore['PASS'] /  df_boxscore['AST']
    df_boxscore['RESULT'] = df_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0 )

    return df_boxscore

Let's see if this worked:

df_knicks_box_scores = custom_boxscore(knicks_id)


df_knicks_box_scores.head(10)
Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN FGM ... PTS DAYS_REST PASS FG2M FG2_PCT OPEN_SHOTS OPEN_EFG COVERED_EFG PASS/ASSIST RESULT
0 1610612752 0021600845 2017-02-15 NYK @ OKC L 23 34 0.404 240 41 ... 105 3.0 339.0 26.0 4.070 38.0 0.500000 0.572917 17.842105 0
1 1610612752 0021600817 2017-02-12 NYK vs. SAS W 23 33 0.411 240 34 ... 94 2.0 261.0 22.0 4.882 28.0 0.750000 0.437500 14.500000 1
2 1610612752 0021600800 2017-02-10 NYK vs. DEN L 22 33 0.400 240 52 ... 123 2.0 313.0 31.0 5.733 46.0 0.652174 0.638298 8.694444 0
3 1610612752 0021600791 2017-02-08 NYK vs. LAC L 22 32 0.407 240 46 ... 115 2.0 336.0 36.0 4.981 47.0 0.542553 0.544444 13.440000 0
4 1610612752 0021600768 2017-02-06 NYK vs. LAL L 22 31 0.415 240 37 ... 107 2.0 316.0 30.0 4.501 35.0 0.571429 0.445652 19.750000 0
5 1610612752 0021600759 2017-02-04 NYK vs. CLE L 22 30 0.423 240 39 ... 104 3.0 308.0 21.0 3.457 46.0 0.510870 0.486842 13.391304 0
6 1610612752 0021600733 2017-02-01 NYK @ BKN W 22 29 0.431 240 35 ... 95 1.0 305.0 23.0 4.150 37.0 0.297297 0.435484 13.260870 1
7 1610612752 0021600724 2017-01-31 NYK @ WAS L 21 29 0.420 240 34 ... 101 2.0 293.0 24.0 5.245 41.0 0.317073 0.460784 16.277778 0
8 1610612752 0021600711 2017-01-29 NYK @ ATL L 21 28 0.429 340 51 ... 139 2.0 479.0 32.0 4.137 64.0 0.437500 0.500000 14.968750 0
9 1610612752 0021600699 2017-01-27 NYK vs. CHA W 21 27 0.438 240 46 ... 110 2.0 350.0 31.0 6.167 37.0 0.594595 0.483051 15.909091 1

10 rows × 36 columns

I'm going to throw in a safeguard against divide by 0 errors just in case. This is a really janky, ugly fix, but it'll get the job done for the time being:

def custom_boxscore(roster_id):
    game_logs  = team.TeamGameLogs(roster_id)

    df_game_logs = game_logs.info()

    df_game_logs['GAME_DATE'] =  pd.to_datetime(df_game_logs['GAME_DATE'])
    df_game_logs['days_rest'] =  df_game_logs['GAME_DATE'] - df_game_logs['GAME_DATE'].shift(-1)
    df_game_logs['days_rest'] =  df_game_logs['days_rest'].astype('timedelta64[D]')

    ##Just like before, that should get us the gamelogs we need and the rest days column

    ##Now to loop through the list of dates for our other stats

    ##Build up a  dataframe of our custom stats and join that to the gamelogs instead of joining  each individual row

    df_all =pd.DataFrame() ##blank dataframe

    dates = df_game_logs['GAME_DATE']

    for date in dates:

        game_info = team.TeamPassTracking(roster_id,  date_from=date, date_to=date).passes_made()
        game_info['GAME_DATE'] = date ## We need to append the date to this so we can  join back

        temp_df = game_info.groupby(['GAME_DATE']).sum()
        temp_df.reset_index(level =  0,  inplace =  True)

        ##now to get the shot info. For the most part, we're just reusing code we've already written
        open_info =  team.TeamShotTracking(roster_id,date_from =date,  date_to =  date).closest_defender_shooting()
        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda x: True if 'Open' in x else False)

        temp_df['OPEN_SHOTS'] = open_info.loc[open_info['OPEN'] == True, 'FGA'].sum()
        temp_df['COVERED_SHOTS'] = open_info.loc[open_info['OPEN'] == False, 'FGA'].sum()

        if open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0:
            temp_df['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        else:
             temp_df['OPEN_EFG'] = 0

        if open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0:
             temp_df['COVER_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
        else:
            temp_df['COVER_EFG'] = 0
        ##append this to our bigger dataframe

        df_all = df_all.append(temp_df)

    df_boxscore =  pd.merge(df_game_logs, df_all[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS','COVERED_SHOTS', 'OPEN_EFG', 'COVER_EFG']], how = 'left', left_on = df_game_logs['GAME_DATE'], right_on = df_all['GAME_DATE'])
    df_boxscore['PASS_ASSIST'] = df_boxscore['PASS'] /  df_boxscore['AST']
    df_boxscore['RESULT'] = df_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0 )

    return df_boxscore


    df_knicks_box_scores = custom_boxscore(knicks_id)

Awesome! Looks like everything came out okay. With a team_id, we can do this with every team. We just need a get a list of team_ids and team names.

From the documentation:

http://nba-py.readthedocs.io/en/0.1a2/nba_py/

import nba_py

teams =  nba_py.Scoreboard()
df_teams = pd.concat([teams.east_conf_standings_by_day()[['TEAM','TEAM_ID']], teams.west_conf_standings_by_day()[['TEAM','TEAM_ID']]])


df_teams.head()
TEAM TEAM_ID
0 Boston 1610612738
1 Cleveland 1610612739
2 Toronto 1610612761
3 Washington 1610612764
4 Milwaukee 1610612749

Now we can pass in the team IDs to create custom boxscores for all teams.

from nba_py import team
import time
def game_summary_teams(roster_ids, team_names):
    print roster_ids, team_names

    for i in range (0, len(roster_ids)):
        time.sleep(35)
        print roster_ids[i]
        print team_names[i]

        info = team.TeamGameLogs(roster_ids[i])
        df = info.info()
        df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])
        df['DAYS_REST'] =   df['GAME_DATE'] - df['GAME_DATE'].shift(-1) ##this gives us our days rest column
        df['DAYS_REST']= df['DAYS_REST'].astype('timedelta64[D]')

        vals  = df['GAME_DATE']

        df_passes =  pd.DataFrame()
        for v in vals:
            time.sleep(.5)
            ##these values are  being over-written each time
            print 'GAME FOR' +  team_names[i] + ' for ' + str(v)
            game_info = team.TeamPassTracking(roster_ids[i], date_from =v, date_to = v).passes_made()
            time.sleep(5)

            game_info['EVENT_DATE'] = v
            df_sum = game_info.groupby(['EVENT_DATE']).sum()
            df_sum.reset_index(level = 0, inplace =  True)

            open_info = team.TeamShotTracking(roster_ids[i], date_from =v, date_to = v).closest_defender_shooting()
            time.sleep(5)
            open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
            df_sum['OPEN_SHOTS'] = open_info.loc[open_info['OPEN']== True, 'FGA'].sum()
            df_sum['COVERED_SHOTS'] = open_info.loc[open_info['OPEN']== False, 'FGA'].sum()

            ##gotta figure out a better divide by 0 fix.
            if (open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0):
                df_sum['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
            else:
                df_sum['OPEN_EFG'] = 0

            if (open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0):
                df_sum['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
            else:
                df_sum['COVERED_EFG']=0
            df_passes = df_passes.append(df_sum)

            info = pd.merge(df, df_passes[['PASS', 'FG2M', 'FG2_PCT','OPEN_SHOTS', 'OPEN_EFG','COVERED_SHOTS', 'COVERED_EFG']], how = 'left',left_on = df['GAME_DATE'], right_on = df_passes['EVENT_DATE'])
            info['PASS_AST'] = info['PASS']/info['AST']

            info['RESULT'] = info['WL'].map(lambda x: 1 if 'W' in x else 0 )

            file_name = team_names[i]
            info.to_csv(file_name, index =  False)


teams = df_teams['TEAM']
roster_ids = df_teams['TEAM_ID']

Just feed in these two arrays into the function and we should be good to go.

I went ahead and did this for a few teams. The NBA's website cuts you off if you make too many requests too quickly (hence all the sleep statements above).

After fiddling with it for a while, I was finally able to get the data for each team. You might have to run the code above piece by piece, or just use the CSVs here:

Visualization

Let's see if we can visually represent anything about each team's offense:

These visualizations are going to be done in Plotly because I think it's the best vizualiation library out there for quickly and easily making graphs that are both visually appleaing and interactive, but feel free to use something else (although I can't imagine why you would).

This is going to break my heart a bit...but lets compare some of the stats fetched between the Knicks and teams that aren't the Knicks.

Something about not counting another man's money right?

knicks = pd.read_csv('New York.csv')
spurs =  pd.read_csv('San Antonio.csv')
warriors = pd.read_csv('Golden State.csv')
thunder = pd.read_csv('Oklahoma City.csv')
celtics =  pd.read_csv('Boston.csv')


import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks['PASS'],
    name='Knicks',
    boxmean='sd'
)
trace1 = go.Box(
    y=spurs['PASS'],
    name='Spurs',
    boxmean='sd'
)
trace2 = go.Box(
    y=warriors['PASS'],
    name='Warriors',
    boxmean='sd'
)
trace3 = go.Box(
    y=thunder['PASS'],
    name='Thunder',
    boxmean='sd'
)
trace4 = go.Box(
    y=celtics['PASS'],
    name='Celtics',
    boxmean='sd'
)

layout = go.Layout(
    title='Passing Box Plot',
)
data = [trace0, trace1, trace2, trace3, trace4]

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Ignoring the outlier from the 4OT Knicks-Hawks game, this graph is pretty telling. Obviously this isn't the full story, but it looks like the Spurs and Thunder play pretty consistent but different offenses . What's really interesting is that despite the Spurs and Warriors having coaches and systems that emphasize ball movement, they throw FEWER passes than a team like the Knicks.

Let's look at if those passes translate to assists:

import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks['PASS_AST'],
    name='Knicks',
    boxmean='sd'
)
trace1 = go.Box(
    y=spurs['PASS_AST'],
    name='Spurs',
    boxmean='sd'
)
trace2 = go.Box(
    y=warriors['PASS_AST'],
    name='Warriors',
    boxmean='sd'
)
trace3 = go.Box(
    y=thunder['PASS_AST'],
    name='Thunder',
    boxmean='sd'
)
trace4 = go.Box(
    y=celtics['PASS_AST'],
    name='Celtics',
    boxmean='sd'
)
data = [trace0, trace1, trace2, trace3, trace4]

layout = go.Layout(
    title='Passes per Assist',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

These two graphs in conjunction are pretty telling. On average, it takes the Knicks almost 2 more passes than the Spurs, and 5 more than the Warriors to get an assist.

Going by this graph every, ~10th pass the Warriors make results in an assist. From the previous graph, we see that they make an average of 313 passes a game. This almost lines up with their season average of roughly 31 assists/game.

The standard deviation of the above graph can be interpreted in a few ways. On one hand, it's a loose metric of playstyle consistency; teams that play the same way through the entire game are probably going to have a lower standard deviation than teams who pass the ball for 3 quarters and forget to in the 4th (cough cough, New York, cough cough). On the other hand, teams might have different playstyles depending on the lineups they have on the floor, resulting in a higher standard deviation (Spurs).

The Thunder probably fall into this, most likely due to Russell Westbrook averaging over 10 of the team's total 20 assists per game.

Obviously, there's a lot more to the story. How many passes led to FTs? Is there any correlation between passes per assist and wins? If anything, stats like these tell you more about what kind of offense a team runs, not how effectively they run it.

Now let's see if there's any noticeable difference in wins vs losses:

import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks.loc[knicks['WL'] == 'W']['PASS_AST'],
    name='Knicks Wins',
    boxmean='sd'
)
trace1 = go.Box(
    y=knicks.loc[knicks['WL'] == 'L']['PASS_AST'],
    name='Knicks Loss',
    boxmean='sd'
)
trace2 = go.Box(
    y=spurs.loc[spurs['WL'] == 'W']['PASS_AST'],
    name='Spurs Wins',
    boxmean='sd'
)
trace3 = go.Box(
    y=spurs.loc[spurs['WL'] == 'L']['PASS_AST'],
    name='Spurs Loss',
    boxmean='sd'

)
trace4 = go.Box(
    y=warriors.loc[warriors['WL'] == 'W']['PASS_AST'],
    name='Warriors Wins',
    boxmean='sd'
)
trace5 = go.Box(
    y=warriors.loc[warriors['WL'] == 'L']['PASS_AST'],
    name='Warriors Losses',
    boxmean='sd'
)
trace6 = go.Box(
    y=thunder.loc[thunder['WL'] == 'W']['PASS_AST'],
    name='Thunder Wins',
    boxmean='sd'
)
trace7 = go.Box(
    y=thunder.loc[thunder['WL'] == 'L']['PASS_AST'],
    name='Thunder Losses',
    boxmean='sd'
)
trace8 = go.Box(
    y=celtics.loc[celtics['WL'] == 'W']['PASS_AST'],
    name='Celtics Wins',
    boxmean='sd'
)
trace9 = go.Box(
    y=celtics.loc[celtics['WL'] == 'L']['PASS_AST'],
    name='Celtics Lossses',
    boxmean='sd'
)
layout = go.Layout(
    title='Passes per Assist in Wins vs Losses',
)
data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Apart from the Celtics, every team had to make more at least 1 more pass to get an assist in games they lost compared to games they lost. In one way, it's almost like they have to "work harder" for assists.

From the looks of this graph, the Warriors offense when its firing on all cylinders is a in a league of its own.

Just to reiterate once again, the purpose of the visualizations above is to ask, not answer questions.

But now, let's see if we can get any team specific insights from any of this:

The Clippers have played without Chris Paul and Blake Griffin, two of the best passers at their position in the league.

Do the boxscores show how their offense has had to adjust?

clippers  = pd.read_csv('LA Clippers.csv')
clippers  = clippers.sort(['GAME_DATE'], ascending = True)
clippers.head()


/home/virajparekh/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning:

sort(columns=....) is deprecated, use sort_values(by=.....)
Unnamed: 0 Team_ID Game_ID GAME_DATE MATCHUP WL W L W_PCT MIN ... DAYS_REST PASS FG2M FG2_PCT OPEN_SHOTS COVERED_SHOTS OPEN_EFG COVERED_EFG PASS_AST RESULT
77 53.0 1610612746 21600017 2016-10-27 LAC @ POR W 1 0 1.00 240 ... NaN 301.0 21.0 4.047 41.0 50.0 0.463415 0.440000 25.083333 1.0
76 52.0 1610612746 21600035 2016-10-30 LAC vs. UTA W 2 0 1.00 240 ... 3.0 275.0 22.0 4.757 34.0 48.0 0.558824 0.375000 16.176471 1.0
75 51.0 1610612746 21600045 2016-10-31 LAC vs. PHX W 3 0 1.00 240 ... 1.0 276.0 28.0 4.795 38.0 42.0 0.565789 0.511905 13.142857 1.0
74 50.0 1610612746 21600064 2016-11-02 LAC vs. OKC L 3 1 0.75 240 ... 2.0 302.0 24.0 2.893 33.0 54.0 0.424242 0.435185 13.727273 0.0
73 49.0 1610612746 21600074 2016-11-04 LAC @ MEM W 4 1 0.80 240 ... 2.0 300.0 17.0 2.681 41.0 44.0 0.451220 0.397727 15.789474 1.0

5 rows × 38 columns

clippers_rolling = clippers[['GAME_DATE','PTS','TOV','PASS',  'OPEN_SHOTS', 'OPEN_EFG', 'AST',  'PASS_AST']].rolling(5).mean()


clippers_rolling.head(10)
GAME_DATE PTS TOV PASS OPEN_SHOTS OPEN_EFG AST PASS_AST
77 2016-10-27 NaN NaN NaN NaN NaN NaN NaN
76 2016-10-30 NaN NaN NaN NaN NaN NaN NaN
75 2016-10-31 NaN NaN NaN NaN NaN NaN NaN
74 2016-11-02 NaN NaN NaN NaN NaN NaN NaN
73 2016-11-04 100.0 12.8 290.8 37.4 0.492698 18.2 16.783881
72 2016-11-05 100.4 13.0 293.6 37.4 0.514649 20.6 14.392215
71 2016-11-07 105.6 12.4 302.4 39.2 0.544745 22.8 13.435492
70 2016-11-09 104.6 10.6 303.0 38.8 0.537143 23.4 13.131921
69 2016-11-11 110.0 9.2 308.8 41.2 0.563405 22.6 14.064244
68 2016-11-12 114.0 9.8 306.6 40.8 0.591110 23.8 13.218349

If we want to see all of this on the same graph, we need to normalize it. This means we're going to scale each value by subtracting it from the mean, and dividing by standard deviation.

This is a bit of a janky way to do so because it relies on the columns being in the same order.

clippers_norm = clippers[['GAME_DATE', 'PTS','TOV','PASS',  'OPEN_SHOTS', 'OPEN_EFG', 'AST',  'PASS_AST']]
for c in clippers_norm.columns[1:]:
    clippers_rolling[c] = (clippers_rolling[c] - clippers_norm[c].mean())/clippers_norm[c].std()


clippers_rolling.head(10)
GAME_DATE PTS TOV PASS OPEN_SHOTS OPEN_EFG AST PASS_AST
77 2016-10-27 NaN NaN NaN NaN NaN NaN NaN
76 2016-10-30 NaN NaN NaN NaN NaN NaN NaN
75 2016-10-31 NaN NaN NaN NaN NaN NaN NaN
74 2016-11-02 NaN NaN NaN NaN NaN NaN NaN
73 2016-11-04 -0.705388 0.033424 -0.403622 -0.504550 -0.714018 -0.878286 0.739290
72 2016-11-05 -0.672092 0.088895 -0.284434 -0.504550 -0.480817 -0.376102 0.083865
71 2016-11-07 -0.239256 -0.077516 0.090155 -0.202852 -0.161093 0.084234 -0.178321
70 2016-11-09 -0.322493 -0.576750 0.115695 -0.269896 -0.241857 0.209780 -0.261513
69 2016-11-11 0.126991 -0.965042 0.362583 0.132369 0.037146 0.042385 -0.006014
68 2016-11-12 0.459943 -0.798631 0.268936 0.065325 0.331470 0.293477 -0.237828
import plotly.plotly as py
import plotly.graph_objs as go

trace = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PASS'],
    name = 'Pass to Assist'
)
trace1 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PTS'],
    name = 'Points'
)
trace2 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['AST'],
    name = 'Assists'
)
trace3 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PASS_AST'],
    name = 'PASSES PER ASSIST'
)
data = [trace1,trace2 ,trace3]


layout = go.Layout(
    title  = 'Clippers Offense',
    showlegend=True,
    annotations=[
        dict(
            x='2016-12-20',
            y=-1,
            xref='x',
            yref='y',
            text='Blake Griffin Injury',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        ),
          dict(
            x='2017-01-19',
            y=-1,
            xref='x',
            yref='y',
            text='Chris Paul Injury',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-275
        ),
        dict(
            x='2017-01-24',
            y=-1,
            xref='x',
            yref='y',
            text='Blake Griffin Returns',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        ),
         dict(
            x='2017-02-24',
            y=-1,
            xref='x',
            yref='y',
            text='Chris Paul Returns',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        )
    ]
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

According to this, Blake's abscence definitely had an effect on how the team runs their offense. Passes per assist went up just as total points went down after Blake's injury.

From the looks of it, the Clippers were starting to adjust to Blake being out just as CP3 got hurt, but for the most part, there's too much noise here.

Blake gets injured almost every year, so I wonder how this graph looks for 2015-2016 data.....

This is just the tip of the iceburg. Exploratory analysis like this is about finding interesting questions, not answering them.

Building the Pipe:

Now that we can what we can do with the data, let's get everything piping into a database.

If you see yourself doing some of your own analysis or just want some more experience moving and cleaning data, this section is a high level overview of how to do that:

I stored the output from the previous function in CSVs in the same directory as this notebook so they can be easily imported.

Pandas has a built in to_sql function that works with sqlalchemy:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

import glob, os
import time
files = []
os.chdir("YOUR_DIRECTORY_HERE")
for file in glob.glob("*.csv"):
    files.append(file)


from sqlalchemy import *
for f in files:

    team_name = str(f)
    team_name = team_name.partition('.csv')[0].replace(' ','_').replace('.','').lower()

    df  = pd.read_csv(f)
    eng = create_engine("YOUR_CONNECTION_STRING_HERE")
    df.to_sql(con =  eng,  index = False, name = team_name, schema  =  'nba', if_exists = 'append')
    time.sleep(10)

The python sqlalchemy package supports most databases, so refer to the documentation for each db's connection string:

http://www.sqlalchemy.org/

What would really make analysis easier is if information updated after every game. Let's try using the info we have to update one of the CSVs, and then use that to do it for all the other files:

def update_info(file):

    file_name =  str(file)
    name  =  file_name.partition('.csv')[0].replace(' ','_').replace('.','').lower()
    old_info = pd.read_csv(file_name)

    team_id = old_info['Team_ID'].max()
    new_logs = team.TeamGameLogs(team_id).info()

    old_info['GAME_DATE'] =  pd.to_datetime(old_info['GAME_DATE'])
    order = old_info.columns
    new_logs['GAME_DATE'] = pd.to_datetime(new_logs['GAME_DATE'])

    ## If there's no new  games for the team, return:
    if max(new_logs['GAME_DATE']) ==  max(old_info['GAME_DATE']):
        return

    new_logs['DAYS_REST']= new_logs['GAME_DATE'] - new_logs['GAME_DATE'].shift(-1) ##this gives us our days rest column
    new_logs['DAYS_REST']= new_logs['DAYS_REST'].astype('timedelta64[D]')

    ##keeping datatypes consistent
    new_logs['Game_ID'] = new_logs['Game_ID'].astype(str).astype(int)

    ##Append the info from the previously saved CSV and append it to the new game logs
    info =  pd.concat([old_info, new_logs], ignore_index = True)

    ##Drop the duplicates
    info = info.drop_duplicates(['Game_ID'], keep = 'first')

    ## Sort by date
    info = info.sort(['GAME_DATE'], ascending = [0])

    ##Reset the axis
    info =  info.reset_index(drop = True)

    ##Find the dates where there's  no values for any of the stats we fetched. We can make requests for only those dates
    updates = info.loc[np.isnan(info['COVERED_EFG'])]

    ##If the team's boxscore is up to date, return.
    if len(updates) == 0:
        return
    dates  =  updates['GAME_DATE']

    df_passes =  pd.DataFrame()
    for d in dates:
        ##All exactly the same as before

        game_info = team.TeamPassTracking(team_id, date_from =d, date_to = d).passes_made()
        game_info['EVENT_DATE'] = d

        df_sum = game_info.groupby(['EVENT_DATE']).sum()
        df_sum.reset_index(level = 0, inplace =  True)

        open_info = team.TeamShotTracking(team_id, date_from =d, date_to = d).closest_defender_shooting()

        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
        df_sum['OPEN_SHOTS'] = open_info.loc[open_info['OPEN']== True, 'FGA'].sum()
        df_sum['COVERED_SHOTS'] = open_info.loc[open_info['OPEN']== False, 'FGA'].sum()

        if (open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0):
            df_sum['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        else:
            df_sum['OPEN_EFG'] = 0

        if (open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0):
            df_sum['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
        else:
            df_sum['COVERED_EFG']=0

        df_passes = df_passes.append(df_sum)

    df_passes = df_passes.reset_index(drop = True)

    ##Join the new stats with the old information.
    info.update(df_passes[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS', 'OPEN_EFG', 'COVERED_EFG','COVERED_SHOTS']])

    ##Calculate these two post join in case there were any stat corrections
    info['PASS_AST'] = info['PASS'] /  info['AST']
    info['RESULT'] = info['WL'].map(lambda x: 1 if 'W' in x else 0 )

    ##Reorder the columns in the dataframe
    info = info[order]

    ##Save to csv
    info.to_csv(file_name, index = False)

    ##Upload, and replace the information already there. Since we're working with a relatively small volume of data,
    ##upserts isn't worth the time.
    info.to_sql(con =  eng,  index = False, name = name, schema  =  'nba', if_exists = 'replace')

    print name

    return info

You can put the script above in a seperate .py file in the same directory as all of the team CSVs and schedule it via crontab to run automatically so your database and CSVs will always contain the latest information (assuming stats.nba.com doesn't change anything on their end).

Here's a great tutorial on how to do so: https://www.youtube.com/watch?v=hDJ3XQzW8nk

Wrapping it all up

That should be everything you need to get started.

If this was interesting to you and you want to test what you learned, here are a few excercises (in relatively increasing difficulty):

1) The current boxscores only have a team_id in each table. Find a way to insert a column for a team name.

2) Throw in some data about conested/unconested rebounding. This might be interesting when looking at different factors that contribute to wins.

3) There's no player specific data in any of this. Try throwing in a column in each team's boxscores with each game's leading scorer.

4) Compare data across seasons! Is it possible to visualize changes to the Thunder offense after KD left? How different is Tom Thibideau's offense from Sam Mitchell's scheme in Minnesota?

5) Do some data science! I'd love to see what sort of interesting models can be drummed up using this infastructure.

Thanks for reading! Send your suggestions, solutions, or anything else to viraj@astronomer.io



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.