Intro to Logistic Regression: Predicting NFL Game Win Probabilities
How much does the QB matter?

The quarterback is the most important position in football. No one individual player has a more consistent impact on the game than them. This short read looks to show an application of logistic regression by taking a QBs rating scraped from footballreference.com to predict the probability the QBs team won the matchup for the corresponding QB rating.
Logistic regression uses a sigmoid function in order to produce a single value output in the form of a probability. Although multiple variables (or features) can be used, I am using a simple one variable input (QB rating) to predict the probability of a win or loss. The data consists of the career game logs for every starting quarterback from the 2019 season. I filtered to only games where the QB started the game. The end result produced a data set with 2.7 thousand games.
The distribution of the games shows what we might expect: the higher the QB rating the more likely the team is to win!
In fact, when a QB has a rating of 100-149 their team wins 80 percent of the time, 150 or above the team never lost!
Loss Percentage | Tie Percentage | Win Percentage | |
---|---|---|---|
0-49 | 0.758 | 0.000 | 0.242 |
50-99 | 0.557 | 0.005 | 0.438 |
100-149 | 0.194 | 0.002 | 0.804 |
150+ | 0.000 | 0.000 | 1.000 |
I split the data into training and testing. Normally, I would randomly pick training and testing, but since I really want to see how our model performs on todays game we are making the most recent season that just occurred the test set. You could also perform cross validation but that is for another read.
The model results showed that not only was QB rating significant but it allowed me to see how much probability would increase for every QB rating point increase. Using the coefficient for rating, our winning probability increases by .04 for every rating value increase of 1.
Below the summary is the mean probability for QB ratings in groups that match what was showed a bit earlier. Notice above 150 rating doesn’t give us a perfect 100 percent like real life. This is good as we can never know for certain. 100-149 averages a win probability 95 percent while 50-99 averages 47 percent.
Call:
glm(formula = WL ~ Rate, family = "binomial", data = qb_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3240 -1.0159 0.5135 0.8976 2.5919
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.323630 0.194022 -17.13 <2e-16 ***
Rate 0.040207 0.002092 19.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3113.1 on 2292 degrees of freedom
Residual deviance: 2600.4 on 2291 degrees of freedom
AIC: 2604.4
Number of Fisher Scoring iterations: 4
Winning_pct | |
---|---|
Below 50 | 0.12 |
50-99 | 0.47 |
100-149 | 0.95 |
Above 150 | 0.95 |
How do we know what probability is reasonable to say a team wins given their QB rating. The default for logistic is .5, but here I tested the threshold using validation sets. I started from probability greater than .1 to a probability of greater than .9, each stepping by .1. The graph below told me that .6 was the best percent to use in terms of accuracy error.
The final two tables show other metrics commonly used to evaluate models, this particular model was 69 percent accurate which is pretty decent but could be improved. Precision, recall, and F1 are all decent but will be discussed in more detail another day. The final table has the probability of winning the game in the first column, the QB in the second, passer rating in the game, a W or L for a win or loss in the actual game, and finally a W or L for a win or loss predicted by the logistic model.
Scores | |
---|---|
Accuracy | 0.69 |
Percision | 0.76 |
Recall | 0.64 |
F1 | 0.69 |
WinProb | QB | Rate | ActualWL | PredWL |
---|---|---|---|---|
0.9194 | Patrick Mahomes | 143.2 | W | W |
0.8756 | Patrick Mahomes | 131.2 | W | W |
0.8791 | Patrick Mahomes | 132.0 | W | W |
0.4833 | Patrick Mahomes | 81.0 | W | L |
0.5918 | Patrick Mahomes | 91.9 | L | L |
0.6356 | Patrick Mahomes | 96.5 | L | W |
0.8500 | Patrick Mahomes | 125.8 | W | W |
0.8129 | Patrick Mahomes | 119.2 | L | W |
0.4012 | Patrick Mahomes | 72.7 | W | L |
0.4913 | Patrick Mahomes | 81.8 | W | L |
0.5094 | Patrick Mahomes | 83.6 | W | L |
0.7906 | Patrick Mahomes | 115.7 | W | W |
0.7656 | Patrick Mahomes | 112.1 | W | W |
0.4843 | Patrick Mahomes | 81.1 | W | L |
0.4753 | Jimmy Garoppolo | 80.2 | W | L |
0.8747 | Jimmy Garoppolo | 131.0 | W | W |
0.4974 | Jimmy Garoppolo | 82.4 | W | L |
0.7386 | Jimmy Garoppolo | 108.5 | W | W |
0.4803 | Jimmy Garoppolo | 80.7 | W | L |
0.2851 | Jimmy Garoppolo | 59.8 | W | L |
0.7590 | Jimmy Garoppolo | 111.2 | W | W |
0.8985 | Jimmy Garoppolo | 136.9 | W | W |
0.3403 | Jimmy Garoppolo | 66.2 | L | L |
0.7886 | Jimmy Garoppolo | 115.4 | W | W |
0.9268 | Jimmy Garoppolo | 145.8 | W | W |
0.7516 | Jimmy Garoppolo | 110.2 | L | W |
0.8778 | Jimmy Garoppolo | 131.7 | W | W |
0.5762 | Jimmy Garoppolo | 90.3 | L | L |
0.3868 | Jimmy Garoppolo | 71.2 | W | L |
0.8098 | Jimmy Garoppolo | 118.7 | W | W |
Hope you enjoyed this slight intro into the application of Logistic Regression! If you want to learn more or see the code email me!