Intro to Logistic Regression: Predicting NFL Game Win Probabilities

How much does the QB matter?

The quarterback is the most important position in football. No one individual player has a more consistent impact on the game than them. This short read looks to show an application of logistic regression by taking a QBs rating scraped from footballreference.com to predict the probability the QBs team won the matchup for the corresponding QB rating.

Logistic regression uses a sigmoid function in order to produce a single value output in the form of a probability. Although multiple variables (or features) can be used, I am using a simple one variable input (QB rating) to predict the probability of a win or loss. The data consists of the career game logs for every starting quarterback from the 2019 season. I filtered to only games where the QB started the game. The end result produced a data set with 2.7 thousand games.

The distribution of the games shows what we might expect: the higher the QB rating the more likely the team is to win!

In fact, when a QB has a rating of 100-149 their team wins 80 percent of the time, 150 or above the team never lost!

	Loss Percentage	Tie Percentage	Win Percentage
0-49	0.758	0.000	0.242
50-99	0.557	0.005	0.438
100-149	0.194	0.002	0.804
150+	0.000	0.000	1.000

I split the data into training and testing. Normally, I would randomly pick training and testing, but since I really want to see how our model performs on todays game we are making the most recent season that just occurred the test set. You could also perform cross validation but that is for another read.

The model results showed that not only was QB rating significant but it allowed me to see how much probability would increase for every QB rating point increase. Using the coefficient for rating, our winning probability increases by .04 for every rating value increase of 1.

Below the summary is the mean probability for QB ratings in groups that match what was showed a bit earlier. Notice above 150 rating doesn’t give us a perfect 100 percent like real life. This is good as we can never know for certain. 100-149 averages a win probability 95 percent while 50-99 averages 47 percent.

Call:
glm(formula = WL ~ Rate, family = "binomial", data = qb_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3240  -1.0159   0.5135   0.8976   2.5919  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.323630   0.194022  -17.13   <2e-16 ***
Rate         0.040207   0.002092   19.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3113.1  on 2292  degrees of freedom
Residual deviance: 2600.4  on 2291  degrees of freedom
AIC: 2604.4

Number of Fisher Scoring iterations: 4

	Winning_pct
Below 50	0.12
50-99	0.47
100-149	0.95
Above 150	0.95

How do we know what probability is reasonable to say a team wins given their QB rating. The default for logistic is .5, but here I tested the threshold using validation sets. I started from probability greater than .1 to a probability of greater than .9, each stepping by .1. The graph below told me that .6 was the best percent to use in terms of accuracy error.

The final two tables show other metrics commonly used to evaluate models, this particular model was 69 percent accurate which is pretty decent but could be improved. Precision, recall, and F1 are all decent but will be discussed in more detail another day. The final table has the probability of winning the game in the first column, the QB in the second, passer rating in the game, a W or L for a win or loss in the actual game, and finally a W or L for a win or loss predicted by the logistic model.

	Scores
Accuracy	0.69
Percision	0.76
Recall	0.64
F1	0.69

WinProb	QB	Rate	ActualWL	PredWL
0.9194	Patrick Mahomes	143.2	W	W
0.8756	Patrick Mahomes	131.2	W	W
0.8791	Patrick Mahomes	132.0	W	W
0.4833	Patrick Mahomes	81.0	W	L
0.5918	Patrick Mahomes	91.9	L	L
0.6356	Patrick Mahomes	96.5	L	W
0.8500	Patrick Mahomes	125.8	W	W
0.8129	Patrick Mahomes	119.2	L	W
0.4012	Patrick Mahomes	72.7	W	L
0.4913	Patrick Mahomes	81.8	W	L
0.5094	Patrick Mahomes	83.6	W	L
0.7906	Patrick Mahomes	115.7	W	W
0.7656	Patrick Mahomes	112.1	W	W
0.4843	Patrick Mahomes	81.1	W	L
0.4753	Jimmy Garoppolo	80.2	W	L
0.8747	Jimmy Garoppolo	131.0	W	W
0.4974	Jimmy Garoppolo	82.4	W	L
0.7386	Jimmy Garoppolo	108.5	W	W
0.4803	Jimmy Garoppolo	80.7	W	L
0.2851	Jimmy Garoppolo	59.8	W	L
0.7590	Jimmy Garoppolo	111.2	W	W
0.8985	Jimmy Garoppolo	136.9	W	W
0.3403	Jimmy Garoppolo	66.2	L	L
0.7886	Jimmy Garoppolo	115.4	W	W
0.9268	Jimmy Garoppolo	145.8	W	W
0.7516	Jimmy Garoppolo	110.2	L	W
0.8778	Jimmy Garoppolo	131.7	W	W
0.5762	Jimmy Garoppolo	90.3	L	L
0.3868	Jimmy Garoppolo	71.2	W	L
0.8098	Jimmy Garoppolo	118.7	W	W

Hope you enjoyed this slight intro into the application of Logistic Regression! If you want to learn more or see the code email me!