The Yhat Blog

machine learning, data science, engineering

Cricket Survival Analysis

by Keshav Ramaswamy | March 14, 2016

About Keshav: Keshav Ramaswamy is an aspiring data science engineer and a grad student at UF. He advocates the role of open-source in aiding future technology and believes in contributing to open-source projects. Contact him at keshav.ramaswamy@gmail.com.

What's a Survival Curve?

Survival analysis is based on the survival function. The survival function is the probability that the time of death, T, is later than some specified time t.

S(t) = Pr(T > t)

Survival Analysis is used in areas where the time duration of a sample of observations is analysed until an event of death occurs. Survival analysis is applied to mechanical engineering to predict systems failures and in medical sciences to predict patient outcomes.

In this post I'll be using Survival Analysis for a more lighthearted application--to analyze the career lengths of Cricket players.

Survival Analysis in Cricket

This is an attempt to extend this statistical concept to the field of cricket to analyze the career lengths of players. I wasn't able to analyze Test cricket (the highest sport standard) due to the amount of noise in player career data due to the World Wars, Apartheid Crisis, Kerry Packer's cricket series etc. Instead, I've analyzed all players who have played ODI cricket. In this analysis, the event of death is a player's retirement.

Data Collection

There isn't any readily available data when it comes to cricket (yet). ESPNCricinfo still doesn't provide an API to use its StatsGuru database machine, so I had to scrape the data from the Statsguru webpages to acquire the data.

Here's the scraped Statsguru URL.

Scraping

The following method scrapes the required data from the webpages.

Data Cleaning

Once the data is scraped, it has to be cleaned - stripping of the whitespaces and other noise to get it into a proper structure.

Data Transformation

Now I need to transform the data into a structure that will be easy to work with. Here's how I created a dataframe fit for modelling.

I store the scraped data in the form of a pandas dataframe. Since the table is updated continuously - I saved the data as scraped on Oct 2 to a csv file for consistency.

There are 2246 players who have played ODI cricket since its inception in the 1970s. The data is sorted by default by the number of runs scored.

Player Span Mat Inns NO Runs HS Ave BF SR 100 50 0
0 SR Tendulkar (India) 1989-2012 463 452 41 18426 200* 44.83 21367 86.23 49 96 20
1 KC Sangakkara (Asia/ICC/SL) 2000-2015 404 380 41 14234 169 41.98 18048 78.86 25 93 15
2 RT Ponting (Aus/ICC) 1995-2012 375 365 39 13704 164 42.03 17046 80.39 30 82 20
3 ST Jayasuriya (Asia/SL) 1989-2011 445 433 18 13430 189 32.36 14725 91.2 28 68 34
4 DPMD Jayawardene (Asia/SL) 1998-2015 448 418 39 12650 144 33.37 16020 78.96 19 77 28

I renamed the columns to make them easier to work with.

Since the variable I'm after is the length of the player's career, I extract the name of the player and the span(duration of career) from the original dataframe.

player span
0 SR Tendulkar (India) 1989-2012
1 KC Sangakkara (Asia/ICC/SL) 2000-2015
2 RT Ponting (Aus/ICC) 1995-2012
3 ST Jayasuriya (Asia/SL) 1989-2011
4 DPMD Jayawardene (Asia/SL) 1998-2015

Now I want to create a couple of cohorts so I can compare survival curves to look for any insights. First, I create 'career start date' and 'career end date' columns.

player span career_start_date career_end_date
0 SR Tendulkar (India) 1989-2012 1989 2012
1 KC Sangakkara (Asia/ICC/SL) 2000-2015 2000 2015
2 RT Ponting (Aus/ICC) 1995-2012 1995 2012
3 ST Jayasuriya (Asia/SL) 1989-2011 1989 2011
4 DPMD Jayawardene (Asia/SL) 1998-2015 1998 2015

Now I'll add another column, 'career_length' by subtracting the previous two columns.

player span career_start_date career_end_date career_length
0 SR Tendulkar (India) 1989-2012 1989 2012 24
1 KC Sangakkara (Asia/ICC/SL) 2000-2015 2000 2015 16
2 RT Ponting (Aus/ICC) 1995-2012 1995 2012 18
3 ST Jayasuriya (Asia/SL) 1989-2011 1989 2011 23
4 DPMD Jayawardene (Asia/SL) 1998-2015 1998 2015 18

The country of the player has not been represented, though it has been included within the player's name. I think having the country as a column could be useful.

Certain players however have played for more than 1 team - Asia/ ICC etc. I want to remove these players. Other players like Kepler Wessels, Eoin Morgan have played for more than one country. For the sake of simplicity, I'll also ignore these players since they'll likely be outliers with longer career lengths.

Players belonging to the 'Full members' of ICC as the associate countries' data may be noisy and create outliers.

player span career_start_date career_end_date career_length name country
0 SR Tendulkar (India) 1989-2012 1989 2012 24 SR Tendulkar India
1 KC Sangakkara (Asia/ICC/SL) 2000-2015 2000 2015 16 KC Sangakkara SL
2 RT Ponting (Aus/ICC) 1995-2012 1995 2012 18 RT Ponting Aus
3 ST Jayasuriya (Asia/SL) 1989-2011 1989 2011 23 ST Jayasuriya SL
4 DPMD Jayawardene (Asia/SL) 1998-2015 1998 2015 18 DPMD Jayawardene SL

I dropped 'player' and 'span' from the dataframe.

Then I reordered the columns.

name country career_start_date career_end_date career_length
0 SR Tendulkar India 1989 2012 24
1 KC Sangakkara SL 2000 2015 16
2 RT Ponting Aus 1995 2012 18
3 ST Jayasuriya SL 1989 2011 23
4 DPMD Jayawardene SL 1998 2015 18

Censoring the Data

When analyzing the durations of a sample/population, you might find certain individuals whose death has not occured yet. When the data has this behaviour, it is said to be right-censored. It is crucial to include the censored data before modelling, as having only the non-censored data can imply different observations about the data which need not be true. Data for survival analysis can be viewed as a regression dataset where the outcome variable - 'censor' is not defined for few rows.

Censoring in Cricket

The event of death in this example is the event of players retiring from active cricket (ODI). There are players who have not yet retired yet - who are still playing some form of active cricket or have paseed away (e.g. Phil Hughes). These players form the censored data in this case. Unfortunately, Statsguru does not provide any easy way of finding out whether players have retired or not. For sake of this implementation of survival curves, I have manually entered the censoring label for each of the players. In this effort, I have assumed that those players who last played in 2011 or earlier have retired.

I chose 2011 for a few reasons:

• 4 years have passed since 2011 - there is a high chance that players who last played in 2011 have retired or will never play again. Yes, there may be exceptions.
• This accounts for the 2011 World Cup, which coincided with a number of high-profile retirements. Players often draw their careers to a close after a World Cup.
• Also accounts for the 2015 World Cup. Teams look to build towards the next world cup. Those who last played in 2011 will likely never get a chance to play again. (Similar to the first reason.)

Censor label: 1 means that player has retired, 0 means that the player has an active playing career.

name country career_start_date career_end_date career_length censor
0 SR Tendulkar India 1989 2012 24 1
1 KC Sangakkara SL 2000 2015 16 1
2 RT Ponting Aus 1995 2012 18 1
3 ST Jayasuriya SL 1989 2011 23 1
4 DPMD Jayawardene SL 1998 2015 18 1

The following data shows few rows of the data with its censored values.

name country career_start_date career_end_date career_length censor
1742 EP Thompson NZ 2009 2009 1 1
1743 AL Thomson Aus 1971 1971 1 1
1744 RW Tolchard Eng 1979 1979 1 1
1745 CM Tuckett WI 1998 1998 1 1
1746 I Udana SL 2012 2012 1 0
1747 JD Unadkat India 2013 2013 1 0
1748 JM Vince Eng 2015 2015 1 0
1749 Wahidul Gani Ban 1988 1988 1 1
1750 KP Walmsley NZ 2003 2003 1 1
1751 M Watkinson Eng 1996 1996 1 1
1752 S Weerakoon SL 2012 2012 1 0
1753 Zakir Hasan Ban 1997 1997 1 1

We see that there are 376 0-censor labelled players and 1378 1-censor labelled players in the dataset. In other words, there are 376 active players in ODI cricket - that is an average of around 37 players for every team, considering only 10 teams are considered - wthe average length of a squad is around 15-20 - but when the fringe players are included, the count of 37 makes sense.

name country career_start_date career_end_date career_length censor
0 SR Tendulkar India 1989 2012 24 1
1 KC Sangakkara SL 2000 2015 16 1
2 RT Ponting Aus 1995 2012 18 1
3 ST Jayasuriya SL 1989 2011 23 1
4 DPMD Jayawardene SL 1998 2015 18 1

Plotting the Survival Curve

Observations

There is a 50% chance that a player will have a career of at least 6 years. There is 25% chance that a player's career extends beyonds 10 years. There is only a 0.5% chance that a player's career will extend beyond 20 years.

Country Cohorts: Comparing India and Australia

Observations

• The survival curve for Australia is greater than India between 3 and 10 years.
• The survival curve for India dominates that of Australia from 10 years.
• The tail for Australia ends at around 17 years while for India, it extends much longer (until 24 years).

Inferences

• There is a greater chance of an Australian player having a career of 3 - 10 years than an Indian player having a career of the same duration.
• There is roughly a 1 in 4 chance that a player from either of these 2 countries will have a career of at least 10 years.
• It's more probable that a player from India will have a career of 10+ years than an Australian player.
• It is highly unlikely that an Australian player will have an ODI career more than 17 years, while the chances are much greater for an Indian player to have a career that lasts beyond 17 years.

Can these inferences be explained by the sport itself?

• Yes - The Indian cricketers have always been given a long rope in their careers by the BCCI. Even if they seem are out of form or are inconsistent, their experience has always been given more priority than their current form. Various players right from Kapil Dev to the likes of Sachin Tendulkar and Saurav Ganguly have always been given a soft spot considering their contributions and experience.
• Australian players are not so fortunate - they have always been judged based on their current form than their past records. In fact, CA (Cricket Australia) is quite famous for dropping the Waugh brothers in 2002 as their ODI contributions were dwindling.
• The way the boards managed their respective players can be viewed as a major reason for the difference in the curves. Both the countries have boards with different mindsets - although BCCI has become less biased towards players with experience and is selecting players based on form recently.
• It makes sense that there's a greater chance of an Aussie player having a career between 3-10 years than an Indian player. The majority of Indian players will go on to have a career of more than 10 years having played that long - as explained by the role of the boards above.
• The tail of the Australian survival curve is significantly shorter as it's very tough for a player to have a consistent career when he plays for that many years in Australia. It's highly unlikely that he stays in the team. The opposite is true for India, which has a long tail.

Era Cohorts

Cricket as a sport has evolved over the years. The sport we see now is a far cry from what it was during the 1980s.

I divide the players into 3 main 'eras', based on when they made their debut: 2007 - present 1989 - 2006 * Start of ODI cricket - 1988

Then I plot the players based on the era.

Observations

• The survival curve for the players who made their debut in the period 89-2007 clearly dominates the players who made their debut before 1989 from the period, 5 years to the end.
• The domination is evident especially from 9 years on.

Inferences

• There is a greater chance for players who made their debut in the period 89-2007 to have a career length of greater than 9 years than players who made their debuts before 1989.

Can this be explained by the sport itself?

• Cricket has changed drastically from the 1980s to present.
• A major factor which can influence a player's career length is how injury prone his body is.
• The fitness standards of the player have improved over the years thanks to better coaching standards, improved player physio-support, well equipped training facilities etc.
• This can likely lengthen a player's career - especially when he has played for around 10 years. The improved facilities/standards can help him get back to competitive cricket quicker than he would've been able to earlier.

Player Cohorts

Last, I'll take a look at how survival curves differ for the types of Cricket players--batsmen and bowlers.

There are 722 bowlers, 691 batsmen and 97 all-rounders in this data. Now I'll plot the players based on their type.

Observations

• The survival curve for the batsmen is much higher than that of the bowlers.
• The tail for the batsmen extend much beyond 20 while the bowlers' curve ends by 20.

Inferences

• Batsmen are expected to have a longer career than bowlers.
• The chances for the career lengths of the batsmen to have 20+ careers is more than that for the bowlers.

Can this be explained by the sport itself?

• The role of the player in the team - batsman, bowler, or all-rounder plays a huge factor in a player's career length.
• Bowlers deal with many more injuries and have to maintain higher fitness standards than batsmen.
• Also, throughout the history of player selections, more bowlers have been tried and tested - a large number of them dropped after playing a single series or two. The quest to find a bowling combination suited for all playing conditions continues to date,prompting the selectors to use the bowlers on a case-by-case basis. This explains the higher number of bowlers = and the significant difference in their career lengths.

Conclusion and Resources

I hope this post has served as a good introduction to Survival Analysis (albeit via an unconditional application).

Survival Analysis has been described in length by various resources, the following are the major resources I used to understand the concept:

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

ScienceOps: deploy predictive models in production applications without IT.

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.