Category

Statistics

Category

This original paper was published in the ‘International Journal of Statistics and Applied Mathematics’

Yash Mantri

20th May 2021

Note: IPL 2021 has been suspended during the time of writing this paper due to Covid-19 and is expected to resume when conditions are safer.

Abstract

Founded in 2007, the Indian Premiere League (IPL) is one the most watched sports leagues in the world. The event is followed and enjoyed by millions of Indians and cricket lovers across the globe.

This study determines the quantitative factors that have played a role in winning matches in the last three seasons of the IPL. This is done using statistical tools such as Karl Pearson’s and Multiple Correlation along with regression analysis and the t-test. The qualifiers and finalists of the IPL 2021 are predicted using the binomial distribution with the help of past winning percentages. Cricket is not only governed by numerical factors; therefore, the predicted qualifiers are further investigated for a qualitative factor analysis using the chi-square test and rank correlation. This overall analysis successfully predicts the team that has the highest chance of winning the IPL 2021 considering both quantitative and qualitative factors.

Key Terms

IPL, Bowling average, Net Run Rate, Coefficient of correlation, Rank correlation, Binomial distribution, qualitative and quantitative factors of a team.

1.1 Introduction

Cricket, also known as the Gentleman’s Game, is currently the world’s second most popular sport and loved by people from all around the globe.

The first reference to cricket can be traced back to the early 1600s in England when it was played in grammar schools, villages, and farm communities. The first official international match was played in 1877 between England and Australia.

Test cricket is the traditional form of the game which has been played since then. It comprises of two innings each and is played over five days. It is known as the pinnacle form because it tests teams over a long duration.

Almost a decade later, in the 1980s, One Day Internationals (ODIs) gained popularity. This is a quicker format of the game with 50 overs played each side and comprises of one innings each. The well-known International Cricket Council (ICC) Cricket World Cup is contested every four years in this format.

Twenty20 cricket, the newest and modern format, revolutionized the game when it was introduced in 2003. It brought with it, rule changes and less overs that saw the beginning of power hitting and a whole new audience. It triggered the adoption of new skill sets and innovations in both batsmen and bowlers. A typical T20 match takes three hours to complete and includes creative batting, skillful bowling, and brilliant fielding. Other than the ICC Cricket Twenty20 championship, many T20 tournaments have emerged over the years including the Big Bash League, Caribbean Premiere League, Super Smash and of course the Indian Premiere League.

The Indian Premiere League (IPL) was founded by the Board of Control for Cricket in India in 2007. The league is contested by eight teams based out of eight different Indian cities- Each team is formed with the help of shuffling to arrive at four foreign players and seven Indian players. This shuffling process is achieved through an auction which takes place every year before the start of IPL. The IPL is the most-attended cricket league in the world and was ranked sixth by average attendance among all sports leagues.

1.2 Aim

To determine the most effective quantitative factors that have played a role in winning matches in the last three seasons of the Indian Premiere League and further analyze qualitative factors of teams.

To make predictions about the qualifiers and winner of the 2021 season using various statistical tools and probability distributions.

2.1 Data Collection (IPL 2020)

TeamMatches WonAverage of top 5 run scorersAverage of top 3 Strike RatesTotal number of 6s hitTotal number of 4s hitDot balls per matchBatsmen with strike rate greater than 140Total 50s per teamBowling Average of top 2 opening pacers from each teamTotal number of wickets taken per match
Mumbai Indians9422.4172.051372224661716.625.9
Delhi Capitals8412144.66882364041420.015.7
Sunrisers Hyderabad7369.8148.42792134231718.95.62
Royal Challengers Bangalore7358135.46431763621418.75
Kolkata Knight Riders7321.8141.79861973721126.35.2
Punjab Kings6373156.139817943214245.07
Chennai Super Kings6308.6148.177518838212285
Rajasthan Royals6310.8160.91310517142311304.3

2.2 Karl Pearson’s Coefficient of Correlation:

The Pearson Product Moment Correlation shows the linear relationship between an independent variable and dependent variable . It is used to find the association between the two variables.

Here, (i) n is the sample size of the dataset and equals 8 (ii) x is the data set of the quantitative values of the factor being analyzed (independent variables) (iii) y is the data-set of the number of matches won (dependent variables)

Note: The data from the IPL 2020 season has been analyzed.

First, the factor of the mean of the bowling average of the top 2 opening pacers from each team has been taken. A player’s bowling average is the number of runs they have conceded per wicket taken. Hence, the lower this value is for a bowler, the better it is considered. The correlation coefficient between the bowling average of top 2 opening pacers from each team and the matches won by that team in that season has been calculated:

TeamBowling Average of top 2 opening pacers from each team (X)Matches won(Y)XYX2Y2
Mumbai Indians16.629149.58276.281
Delhi Capitals20816040064
Sunrisers Hyderabad19713336149
Royal Challengers Bangalore19713336149
Kolkata Knight Riders26.37184.1691.749
Punjab Kings24614457636
Chennai Super Kings28616878436
Rajasthan Royals30618090036
 =182.92=56=1251.68=4349.9=400

Calculating r,

(High Inverse Correlation)

Thus, the correlation between the two variables is a high and inverse relationship. Since a lower bowling average is better for a bowler, that is why a negative association can be seen. As the coefficient of correlation is high (lower than -0.75), it can be assumed that it is a contributing factor in winning matches in the Indian Premiere League.

The same formula has been applied to a number of possible factors and the coefficient of correlation between them has been obtained.

Table

S.No.FactorsCoefficient of CorrelationRemark
1.Average runs scored by top 5 run scorers from each team+0.8073Strong Correlation
2.Average of Top 3 Strike Rates from each team+0.2887Weak Correlation
3.Average number of 6s hit per match+0.4174Weak Correlation
4.Average number of 4s hit per match+0.6405Moderate Correlation
5.Total number of 50s scored by each team+0.6188Moderate Correlation
6.Batsmen in each team with a strike rate greater than 140+0.8504Strong Correlation
7.Average number of dots per match+0.3558Weak Correlation
8.Bowling Average of top 2 opening pacers from each team-0.786Strong Inverse Correlation
9.Bowling economy of top two spinners-0.092Weak Inverse Correlation
10.Total number of wickets taken per match+0.826Strong Correlation

It can be observed from the table that the factors which mostly contribute to winning matches are:

  1. Average runs scored by top 5 run scorers from each team
  2. Batsmen in each team with a strike rate greater than 140
  3. Bowling Average of top 2 opening pacers from each team
  4. Total number of wickets taken per match

These four factors affect the winning of the teams by 61%-72.5%

(Coefficient of determination)

2.3 t-test

The widely used t-test is be used next to establish whether this correlation is true for the population data which includes all T20 cricket matches and other Twenty20 Leagues. The t-test value is calculated using the following formula:

After putting the respective for and ,  a t-test value of 3.959 for the relationship between matches won and batsmen in each team with strike rate greater than 140 was obtained. Our significance level has been taken as .

Null Hypothesis(Ho): The correlation between matches won and batsmen with strike rate greater than 140 is significant in the population data as well.

Alternate Hypothesis(Ha): The correlation between matches won and batsmen with strike rate greater than 140 is not significant in the population data.

Degree of Freedom- The formula for obtaining the value of the degrees of freedom is . A value of 6 as n=8 is obtained and finally  a critical t-test value of 2.447 is obtained.

Hence, as the value of 3.959  exceeds the critical value, the null hypothesis is accepted, establishing the correlation observed in the sample data to be statistically significant  for the population data as well. The same was observed for the rest of the factors as well after repeating the same procedure. Thus, it can be proven that for all other twenty20 leagues, the same factors will contribute in winning. This can even be generalised  to everyday gully cricket T-20 tournaments!

2.4 Regression Analysis

A linear regression model is one that assesses the relationship between a dependent variable and an independent variable. Next, the regression analysis is used to predict the matches won by a team with respect to the independent variable such as the total number of wickets taken by a team per match. The following equation is used  in order to calculate the regression coefficient for the linear regression graph of y against x:

A value of 1.789 for   is used using which the linear regression model expression using the following equation is calculated:

Finally, the equation can be re-arranged to form a simplified equation for y:

Over here: (i) , having a value of 7, represents the mean value of (number of matches won) and (ii) , having a value of 5.23, represents the mean value of (number of wickets taken per match)

Similarly, the regression equations were found for the other three factors as well:

Average runs scored by top 5 run scorers from each team:

Batsmen in each team with a strike rate greater than 140: 

Bowling Average of top 2 opening pacers from each team:  

2.5 Multi-Variable Correlation

The relationship between two variables has been determined using correlation and used regression as a prediction tool to predict matches won in a season depending on certain factors. Multi-variable correlation is a measure of how well a given variable (in this case it is the matches won by a team in a season of the IPL) can be predicted using a linear function of a set of other variables. It was decided that multi variable statistics will be used and the multi-variable correlation of total number of wickets taken per match will be measured. In addition, the bowling average of the top two opening pacers with matches won by a team will be calculated, which is given by the equation:

  where

  • = Correlation between matches won and total wickets taken per match.
  • = Correlation between total wickets taken per match and bowling average of top 2 opening pacers from each team.
  • = Correlation between matches won and bowling average of top 2 opening pacers from team.

The values of the Pearson’s Co-efficient of Correlation have been tabulated in the table below.

FactorsPearson’s Co-efficient of Correlation
Matches won and Total wickets taken per match ()+0.826
Matches won and Bowling average of top 2 opening pacers from team ()-0.786
Total wickets taken per match and Bowling average of top 2 opening pacers from each team ()-0.802

Therefore, in the Indian Premiere League, Total wickets taken per match and Bowling average of top 2 opening pacers together have a positive impact of 72.51% on the Total number of matches won by a team in a season.

Similarly, we can take multiple factors into account and calculate how much they affect winning.

3.1 Net Run Rate

Teams that have a higher overall net run rate in the season tend to be the teams that end up qualifying for the playoffs and the winning teams usually have one of the highest net run rates in the season.

So, what is net run rate? One hears of this terminology many times when it comes down to choosing between two teams that have won the same number of games. The team with a higher net run rate is always ranked higher than a team that has a lower net run rate given that the matches won by both teams are same.

Net run rate is a statistic used in cricket used to put runs scored and conceded in comparison with the number of overs faced and bowled.

Net Run Rate

Now, a table with the average net run rate over 3 years has been calculated and shown below:

TeamAverage Net Run Rate between 2018-20
Mumbai Indians0.615
Delhi Capitals-0.083
Sunrisers Hyderabad0.490
Royal Challengers Bangalore-0.216
Kolkata Knight Riders-0.085
Punjab Kings-0.305
Chennai Super Kings-0.008
Rajasthan Royals-0.427

Thus, it can be observed that the top four teams with the best average net run rate over the last 3 seasons are:

  1. Mumbai Indians
  2. Sunrisers Hyderabad
  3. Chennai Super Kings
  4. Delhi Capitals

IPL 2021 was unfortunately postponed midway due to COVID-19. Most teams have already played 7 out of 14 matches. Looking at the teams above, three out of the four teams with highest net run rates are in the top four teams of the current points table so far.

Thus, the teams that we will further investigate are:

  1. Mumbai Indians
  2. Chennai Super Kings
  3. Delhi Capitals

The aim is to make predictions about the qualifiers and winner of the IPL 2021 once it resumes.

3.2 Binomial Distribution

Cricket is a game of glorious uncertainties. Anything can change at any point in time and that is what makes it so exciting. The IPL has witnessed the breaking of numerous T20 records and many nail-biting matches. An example of this is the match played between Mumbai Indians and Rajasthan Royals in IPL 2014. This is considered to be one of the most dramatic thrillers in IPL history. All the hard work of the season came down to one ball. Mumbai Indians had to score a boundary off the last ball and Aditya Tare helped Mumbai clinch the game with a six, advancing Mumbai Indians to the playoffs leaving Wankhede Stadium in wild celebrations.

In mathematics, uncertainty is measured by probability. The binomial distribution is a common discrete probability distribution used in statistics. Using the winning probabilities of our three shortlisted teams, it is possible to use binomial distribution to calculate the probability of these teams making it to the playoffs and winning. The binomial distribution only counts two states, typically representing success or failure (in this case win or loss) given a fixed number of trials in the data.

The first step is to calculate the winning probabilities of the three teams based on the last three years.

Using this we get,

Mumbai Indians= 59.1 %

Chennai Super Kings= 57.2 %

Delhi Capitals= 52.38 %

Before we use binomial distribution, we need to take into account that before the IPL was postponed , Delhi Capitals had played 8 games and Mumbai Indians and Chennai Super Kings had played 7 games each.

Observing the last three seasons, a team that wins 7 or more matches(given a high net run rate), makes it to the playoffs. In most cases, a team that wins more than 8 matches, goes on to make it to the finals.

Now we can calculate the following probabilities for each team using the binomial distribution:

Mumbai Indians

Games left=7

Games required to win to reach playoffs= 3 or more

Games required to win to play the final= more than 4

Calculating,

Probability (Mumbai Indians will reach the playoffs):

P (Winning 3 matches) + P (Winning 4 matches) + P (Winning 5 matches) + P(Winning 6 matches) + P(Winning 7 matches)=

= 0.89477 = 89.477%

Using the same procedure,

Probability (Mumbai Indians will reach the final) =0.4004= 40.04%

This is the graphical representation of the binomial probability distribution of

Mumbai Indian’s next seven games.

Chennai Super Kings

Games left=7

Games required to win to reach playoffs= 2 or more

Games required to win to play the final= more than 3

Calculating,

Probability (Chennai Super Kings will reach the playoffs) =0.97275= 97.275%

Probability (Chennai Super Kings reach the final) =0.65427= 65.427%

Delhi Capitals

Games left=6

Games required to win to reach playoffs= 1 or more

Games required to win to play the final= more than 2

Calculating,

Probability (Delhi Capitals will reach the playoffs) =0.98833=98.83%

Probability (Delhi Capitals make the final) =0.6997=69.97%

It can be observed that it is very likely that all three of these teams will make the playoffs. The teams that are most likely to win 9 or more matches and reach the finals are Delhi Capitals with a chance of 69.97% and Chennai Super Kings with 65.427%. Thus, it can be predicted that Delhi Capitals or Chennai Super Kings have the highest chance of reaching the final in IPL 2021 currently.

3.3 Qualitative Factor Analysis using Chi-Square Test

Cricket can certainly not only be controlled by quantitative factors. There are numerous other qualitative factors that contribute to the overall performance of a team. These include fielding, fitness, team motivation, home ground advantage and many more.

To investigate whether teams rank differently based on the qualitative factors, it was decided to use a chi-square test. A questionnaire was sent out to a random sample of 100 cricket enthusiasts who had been following the IPL 2021. They were asked to rate teams on a scale of 1 to 5(5 being best) keeping in mind their overall fielding, fitness, and team spirit.

The three teams we will apply the test on are Mumbai Indians, Delhi Capitals and

Chennai Super Kings.

Below shows the responses to the fielding rating of the three teams.

 Rating of 1,2 or 3 (Low)Rating of 4 (Medium)Rating of 5 (High)
Mumbai Indians266311
Chennai Super Kings283636
Delhi Capitals235621

Ho (Null Hypothesis): The rating of IPL teams is independent of fielding performance.

Ha (Alternative Hypothesis): The rating of IPL teams is dependent on fielding performance.

Degrees of Freedom= (Number of rows – 1) (Number of columns – 1) = 4

Level of Significance ()= 5%

Critical value of for 4 degrees of freedom at 5% level of significance is 9.488

Calculating the Chi-square statistic, the value we get is 21.5554.

The graphical display of the chi-square distribution at 4 degrees of freedom.

Conclusion: Since the calculated value is more than the critical value; we can reject the null hypothesis proving that the rating of teams is dependent on fielding performance.

The same procedure was carried out for overall fitness and team spirit. The results obtained were the same and there was statistically significant evidence that the rating of teams is dependent on the qualitative factors.

3.4 Spearman’s Rank Correlation Coefficient

It was decided to finally use Spearman’s Rank correlation to investigate 6 qualitative factors for the two predicted finalists-Delhi Capitals and Chennai Super Kings. In order to carry out the most effective research, two senior cricket experts and analysts Mr. Gaurav Kalra (Senior Editor at ESPN) and S. Ganesh (Sports consultant and ex-manager of Punjab Kings) graciously provided us with their opinionated scores based on these factors out of 10. The six factors included Fielding, Fitness, Popularity, Team Energy, Team Balance, Hitting Power, and Death-Over Bowling.

The formula for Spearman’s Rank Correlation is:

As ranks are repeated, the rank correlation formula with the repeated ranks is as follows:

Where is the rank difference and n is the total number of samples.

Table of Ranks for Both Teams:

                                                     Delhi Capitals             Chennai Super Kings

FactorExpert 1Expert 2Expert 1Expert 2
Fielding6776.5
Fitness4567
Popularity6688
Team Energy7688.5
Team Balance7865.5
Hitting Power86.578.5
Death Over Bowling7676.5

After the total calculation using the formula, the rank correlation between ratings given by both experts is 0.712 for Chennai Super Kings and 0.411 for Delhi Capitals. These values indicate that both the senior analysts have moderately similar views regarding the qualitative aspects of both the teams, justifying their ratings. It was  also observed that Chennai Super Kings was rated higher than Delhi Capitals by the two experts for most of the qualitative factors for the year 2021.

Based on the overall analysis, it can be predicted that Chennai Super Kings has the highest chance of winning the IPL 2021, having a high probability of winning more than 8 matches along with high ratings in their qualitative aspects-fielding, fitness, death-over bowling etc.

Conclusion

In accordance with the aim, the quantitative factors that have played a role in winning matches in the last three seasons have been obtained. These include total number of wickets taken per match, average of the top two opening bowlers (bowling average), top five run scorers average and number of batsmen with a strike rate of greater than 140 in each team. The qualifiers and finalists of the IPL 2021 have been predicted using the binomial distribution. The study predicts that three teams- Mumbai Indians, Chennai Super Kings, and Delhi Capitals have the highest probability (more than 90%) of making it to the qualifiers. Delhi Capitals and Chennai Super Kings are the two predicted finalists. Using the Chi-square test and Rank Correlation, a qualitative factor analysis was successfully carried out. This gave us the result that the ratings of IPL teams are dependent on qualitative factors. The Rank Correlation showed a moderate positive correlation between the ratings given to both the teams by both the experts. Since Chennai Super Kings was rated higher than Delhi Capitals by both the experts in their qualitative aspect, they have been predicted as the winners of the IPL 2021.

It must be noted that cricket has numerous limitations. Probability and data analysis can only make predictions about the game. The fact that we cannot fully predict anything in the game of cricket till the match is over is what makes it so exciting.

References

https://www.mathsjournal.com/pdf/2021/vol6issue4/PartA/6-4-1-292.pdf

Recently, I watched the film “21”, a story about MIT math students who “count cards” to improve their probability of winning the card game Blackjack at casinos. Evidently, the film had a lot of mathematics related themes in it and this attributed to making the cinematic experience engaging and fast-paced. In one of the initial scenes of the movie, there was an allusion to the Monty Hall problem, a probability-based game, which piqued my interest and engendered me to take on deeper research about the problem. In this article, I explain the working of the Monty Hall Problem.

The Monty Hall problem is a brain teaser, in the form of a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall.

So here is the problem– Let’s say that there are three doors, and behind one of them is a car, while behind the other two are goats. If you choose the door with the car behind it, you win the car. Now, say you choose Door 1. The host Monty Hall then opens either Door 2 or Door 3, behind which is a goat. It is important to note that he knows what is behind each door, and never opens the door with the car behind it. Monty now gives you a choice: do you want to stick with Door 1, or switch to the other door. What should you do? Does it matter? 

The answer is yes, and one must always choose to SWITCH. Before I get into the explanation, let us talk about what people usually decide to do.

To find this, I posed the question and conducted a poll consisting of a sample size of 110 people(who had never heard of the problem), including people from ages 13-50. The results for the same are shown below.   

Looking at the pie chart, it can be observed that more people actually decide to stay with their initial choice. This could be because it is human tendency to be suspicious of being given other options and people in general do not choose to change out of paranoia or fear. And most obviously, the illusion is that both options will result in a 1/2 chance of getting the car so it does not matter to make a switch in any case. However, by probability, this is not the correct answer. 

The Solution

So, initially all we know is that behind 1 door is a car and behind the other 2 are goats. This gives us a probability of 1/3 for each door having a car. Now say we choose Door 1 for example. The probability of that door having a car is 1/3. If we pause this situation and see, the other two doors (Doors 2 and 3) have a combined probability of 2/3 (1/3+1/3).

Moving on, Monty Hall will now open one of the other doors which he knows has a goat in it. Let us say he opens Door 2 and shows us a goat. The probability of Door 2 having a car becomes 0 and since both the probabilities of Door 2 and Door 3 were adding up to 2/3, Door 3 will have a 2/3 chance of getting a car. Door 1 only has a probability of 1/3 that doesn’t change and that’s why it is more 33% more likely to get a car by switching. 

Now you may say that eliminating Door 2 gives rise to only two remaining doors, each having a probability of 1/2 each.

Let’s look at the exact same problem with 100 doors instead of 3. You pick a random door. The probability that you have got the car now is extremely low – 1/100.

Now, instead of one door, Monty eliminates 98 doors. These are doors that he knows do not have the prize. This leaves two doors. The one you picked, and one that was left after Monty eliminated the others.  When you first picked, you only had a 1/100 chance of getting the right door .It was just a random guess. Now you’re being presented with a filtered choice, curated by Monty Hall himself. It should be clear that now your odds are much better if you switch. 

Another way to see the solution is to explicitly list out all the possible outcomes, and count how often you get the car if you stay versus switch. Without loss of generality, suppose your selection was door 1. Then the possible outcomes can be seen in this table:

In two out of three cases, you win the car by changing your selection after one of the doors is revealed. This is because there is a greater probability that you choose a door with a goat behind it in the first go, and then Monty is guaranteed to reveal that one of the other doors has a goat behind it. Hence, by changing your option, you double your probability of winning.

Using Bayes’ Theorem to explain the Monty Hall Problem

 A few months back, I had completed a course on Data Science Math Skills by Duke University where I learnt about conditional probability and the Bayes’ Theorem. The Monty Hall is a classic problem that can be explained through Bayes’ Theorem.

Lets take the assumption that you pick door 1 and then Monty shows you the goat behind door 2.

 In order to use Bayes’ Theorem we need to first assign an event to A and B.

  • Let event A be that the car is behind door number 1.
  • Let event B be that Monty opens up door 2 to show the goat.

Bayes’ Theorem Solution:

Pr(A) is easy to figure out. There is a 1/3 chance that the car is behind door 1. There are two doors left, and each has a 1/2 chance of being chosen — which gives us Pr(B|A), or the probability of event B, given A.
Pr(B), in the denominator, is a little trickier to figure out. Consider that:

  • You choose door 1. Monty shows you a goat behind door 2.
  • If the car is behind door 1, Monty will not choose it. He’ll open door 2 and show a goat 1/2 of the time.
  • If the car is behind door 2, Monty will always open door 3, as he never reveals the car.
  • If the car is behind door 3, Monty will open door 2, 100% of the time.

As Monty has opened door 2, you know the car is either behind door 1 (your choice) or door 3. The probability of the car being behind door 1 is 1/3. This means that the probability of the car being behind door 3 is 1 – (1/3) = 2/3. This is why it is wiser to switch.

The Monty Hall problem is truly a fascinating problem that tends to play tricks with our brain. The important catch in the problem is that Monty knows what is behind the doors and always opens only the doors with a goat in them. If he was not aware of what lies behind the doors then it does not favour you to switch. If you are still not convinced, click on the link and try the simulation for yourself!  http://www.math.ucsd.edu/~crypto/Monty/monty.html

References

https://brilliant.org/wiki/monty-hall-problem/

https://www.statisticshowto.com/probability-and-statistics/monty-hall-problem/

https://towardsdatascience.com/monty-hall-problem-solution-using-bayes-theorem-cb1d6fbc0c9e