About Keshav: Keshav Ramaswamy is an aspiring data science engineer and a grad student at UF. He advocates the role of open-source in aiding future technology and believes in contributing to open-source projects. Contact him at email@example.com.
What's a Survival Curve?
Survival analysis is based on the survival function. The survival function is the probability that the time of death, T, is later than some specified time t.
S(t) = Pr(T > t)
Survival Analysis is used in areas where the time duration of a sample of observations is analysed until an event of death occurs. Survival analysis is applied to mechanical engineering to predict systems failures and in medical sciences to predict patient outcomes.
In this post I'll be using Survival Analysis for a more lighthearted application--to analyze the career lengths of Cricket players.
Survival Analysis in Cricket
This is an attempt to extend this statistical concept to the field of cricket to analyze the career lengths of players. I wasn't able to analyze Test cricket (the highest sport standard) due to the amount of noise in player career data due to the World Wars, Apartheid Crisis, Kerry Packer's cricket series etc. Instead, I've analyzed all players who have played ODI cricket. In this analysis, the event of death is a player's retirement.
There isn't any readily available data when it comes to cricket (yet). ESPNCricinfo still doesn't provide an API to use its StatsGuru database machine, so I had to scrape the data from the Statsguru webpages to acquire the data.
Here's the scraped Statsguru URL.
The following method scrapes the required data from the webpages.
Once the data is scraped, it has to be cleaned - stripping of the whitespaces and other noise to get it into a proper structure.
Now I need to transform the data into a structure that will be easy to work with. Here's how I created a dataframe fit for modelling.
I store the scraped data in the form of a pandas dataframe. Since the table is updated continuously - I saved the data as scraped on Oct 2 to a csv file for consistency.
There are 2246 players who have played ODI cricket since its inception in the 1970s. The data is sorted by default by the number of runs scored.
|0||SR Tendulkar (India)||1989-2012||463||452||41||18426||200*||44.83||21367||86.23||49||96||20|
|1||KC Sangakkara (Asia/ICC/SL)||2000-2015||404||380||41||14234||169||41.98||18048||78.86||25||93||15|
|2||RT Ponting (Aus/ICC)||1995-2012||375||365||39||13704||164||42.03||17046||80.39||30||82||20|
|3||ST Jayasuriya (Asia/SL)||1989-2011||445||433||18||13430||189||32.36||14725||91.2||28||68||34|
|4||DPMD Jayawardene (Asia/SL)||1998-2015||448||418||39||12650||144||33.37||16020||78.96||19||77||28|
I renamed the columns to make them easier to work with.
Since the variable I'm after is the length of the player's career, I extract the name of the player and the span(duration of career) from the original dataframe.
|0||SR Tendulkar (India)||1989-2012|
|1||KC Sangakkara (Asia/ICC/SL)||2000-2015|
|2||RT Ponting (Aus/ICC)||1995-2012|
|3||ST Jayasuriya (Asia/SL)||1989-2011|
|4||DPMD Jayawardene (Asia/SL)||1998-2015|
Now I want to create a couple of cohorts so I can compare survival curves to look for any insights. First, I create 'career start date' and 'career end date' columns.
|0||SR Tendulkar (India)||1989-2012||1989||2012|
|1||KC Sangakkara (Asia/ICC/SL)||2000-2015||2000||2015|
|2||RT Ponting (Aus/ICC)||1995-2012||1995||2012|
|3||ST Jayasuriya (Asia/SL)||1989-2011||1989||2011|
|4||DPMD Jayawardene (Asia/SL)||1998-2015||1998||2015|
Now I'll add another column, 'career_length' by subtracting the previous two columns.
|0||SR Tendulkar (India)||1989-2012||1989||2012||24|
|1||KC Sangakkara (Asia/ICC/SL)||2000-2015||2000||2015||16|
|2||RT Ponting (Aus/ICC)||1995-2012||1995||2012||18|
|3||ST Jayasuriya (Asia/SL)||1989-2011||1989||2011||23|
|4||DPMD Jayawardene (Asia/SL)||1998-2015||1998||2015||18|
The country of the player has not been represented, though it has been included within the player's name. I think having the country as a column could be useful.
Certain players however have played for more than 1 team - Asia/ ICC etc. I want to remove these players. Other players like Kepler Wessels, Eoin Morgan have played for more than one country. For the sake of simplicity, I'll also ignore these players since they'll likely be outliers with longer career lengths.
Players belonging to the 'Full members' of ICC as the associate countries' data may be noisy and create outliers.
|0||SR Tendulkar (India)||1989-2012||1989||2012||24||SR Tendulkar||India|
|1||KC Sangakkara (Asia/ICC/SL)||2000-2015||2000||2015||16||KC Sangakkara||SL|
|2||RT Ponting (Aus/ICC)||1995-2012||1995||2012||18||RT Ponting||Aus|
|3||ST Jayasuriya (Asia/SL)||1989-2011||1989||2011||23||ST Jayasuriya||SL|
|4||DPMD Jayawardene (Asia/SL)||1998-2015||1998||2015||18||DPMD Jayawardene||SL|
I dropped 'player' and 'span' from the dataframe.
Then I reordered the columns.
Censoring the Data
When analyzing the durations of a sample/population, you might find certain individuals whose death has not occured yet. When the data has this behaviour, it is said to be right-censored. It is crucial to include the censored data before modelling, as having only the non-censored data can imply different observations about the data which need not be true. Data for survival analysis can be viewed as a regression dataset where the outcome variable - 'censor' is not defined for few rows.
Censoring in Cricket
The event of death in this example is the event of players retiring from active cricket (ODI). There are players who have not yet retired yet - who are still playing some form of active cricket or have paseed away (e.g. Phil Hughes). These players form the censored data in this case. Unfortunately, Statsguru does not provide any easy way of finding out whether players have retired or not. For sake of this implementation of survival curves, I have manually entered the censoring label for each of the players. In this effort, I have assumed that those players who last played in 2011 or earlier have retired.
I chose 2011 for a few reasons:
- 4 years have passed since 2011 - there is a high chance that players who last played in 2011 have retired or will never play again. Yes, there may be exceptions.
- This accounts for the 2011 World Cup, which coincided with a number of high-profile retirements. Players often draw their careers to a close after a World Cup.
- Also accounts for the 2015 World Cup. Teams look to build towards the next world cup. Those who last played in 2011 will likely never get a chance to play again. (Similar to the first reason.)
Censor label: 1 means that player has retired, 0 means that the player has an active playing career.
The following data shows few rows of the data with its censored values.
We see that there are 376 0-censor labelled players and 1378 1-censor labelled players in the dataset. In other words, there are 376 active players in ODI cricket - that is an average of around 37 players for every team, considering only 10 teams are considered - wthe average length of a squad is around 15-20 - but when the fringe players are included, the count of 37 makes sense.
Plotting the Survival Curve
There is a 50% chance that a player will have a career of at least 6 years. There is 25% chance that a player's career extends beyonds 10 years. There is only a 0.5% chance that a player's career will extend beyond 20 years.
Country Cohorts: Comparing India and Australia
- The survival curve for Australia is greater than India between 3 and 10 years.
- The survival curve for India dominates that of Australia from 10 years.
- The tail for Australia ends at around 17 years while for India, it extends much longer (until 24 years).
- There is a greater chance of an Australian player having a career of 3 - 10 years than an Indian player having a career of the same duration.
- There is roughly a 1 in 4 chance that a player from either of these 2 countries will have a career of at least 10 years.
- It's more probable that a player from India will have a career of 10+ years than an Australian player.
- It is highly unlikely that an Australian player will have an ODI career more than 17 years, while the chances are much greater for an Indian player to have a career that lasts beyond 17 years.
Can these inferences be explained by the sport itself?
- Yes - The Indian cricketers have always been given a long rope in their careers by the BCCI. Even if they seem are out of form or are inconsistent, their experience has always been given more priority than their current form. Various players right from Kapil Dev to the likes of Sachin Tendulkar and Saurav Ganguly have always been given a soft spot considering their contributions and experience.
- Australian players are not so fortunate - they have always been judged based on their current form than their past records. In fact, CA (Cricket Australia) is quite famous for dropping the Waugh brothers in 2002 as their ODI contributions were dwindling.
- The way the boards managed their respective players can be viewed as a major reason for the difference in the curves. Both the countries have boards with different mindsets - although BCCI has become less biased towards players with experience and is selecting players based on form recently.
- It makes sense that there's a greater chance of an Aussie player having a career between 3-10 years than an Indian player. The majority of Indian players will go on to have a career of more than 10 years having played that long - as explained by the role of the boards above.
- The tail of the Australian survival curve is significantly shorter as it's very tough for a player to have a consistent career when he plays for that many years in Australia. It's highly unlikely that he stays in the team. The opposite is true for India, which has a long tail.
Cricket as a sport has evolved over the years. The sport we see now is a far cry from what it was during the 1980s.
I divide the players into 3 main 'eras', based on when they made their debut: 2007 - present 1989 - 2006 * Start of ODI cricket - 1988
Then I plot the players based on the era.
- The survival curve for the players who made their debut in the period 89-2007 clearly dominates the players who made their debut before 1989 from the period, 5 years to the end.
- The domination is evident especially from 9 years on.
- There is a greater chance for players who made their debut in the period 89-2007 to have a career length of greater than 9 years than players who made their debuts before 1989.
Can this be explained by the sport itself?
- Cricket has changed drastically from the 1980s to present.
- A major factor which can influence a player's career length is how injury prone his body is.
- The fitness standards of the player have improved over the years thanks to better coaching standards, improved player physio-support, well equipped training facilities etc.
- This can likely lengthen a player's career - especially when he has played for around 10 years. The improved facilities/standards can help him get back to competitive cricket quicker than he would've been able to earlier.
Last, I'll take a look at how survival curves differ for the types of Cricket players--batsmen and bowlers.
There are 722 bowlers, 691 batsmen and 97 all-rounders in this data. Now I'll plot the players based on their type.
- The survival curve for the batsmen is much higher than that of the bowlers.
- The tail for the batsmen extend much beyond 20 while the bowlers' curve ends by 20.
- Batsmen are expected to have a longer career than bowlers.
- The chances for the career lengths of the batsmen to have 20+ careers is more than that for the bowlers.
Can this be explained by the sport itself?
- The role of the player in the team - batsman, bowler, or all-rounder plays a huge factor in a player's career length.
- Bowlers deal with many more injuries and have to maintain higher fitness standards than batsmen.
- Also, throughout the history of player selections, more bowlers have been tried and tested - a large number of them dropped after playing a single series or two. The quest to find a bowling combination suited for all playing conditions continues to date,prompting the selectors to use the bowlers on a case-by-case basis. This explains the higher number of bowlers = and the significant difference in their career lengths.
Conclusion and Resources
I hope this post has served as a good introduction to Survival Analysis (albeit via an unconditional application).
Survival Analysis has been described in length by various resources, the following are the major resources I used to understand the concept:
- Allen B. Downy's book on exploratory data analysis in Python includes a great chapter on survival curves, hazard functions, Kaplan–Meier estimators etc.
- Econometrics Academy's notes on survival analysis
- Cam Davidson-Pilon's documentation on the lifelines python library on survival analysis.
- Nate Silver's 'The Signal and the Noise: Why So Many Predictions Fail-but Some Don't' has immensely influenced my perspective about predictions in general.