Predicting Wins in the NBA
Can we accurately predict wins before an NBA schedule release?

** updates will be made again after the 2021 draft and free agency
There are currently many season and game prediction models for NBA games but due to teams resting star players most of the models rely on schedules and match-ups. Fivethirtyeight’s model for example, predicts how many minutes a player will play in a game against a certain opponent. By predicting how many minutes a player will play, you can look at the impact of that player on a given game. However, free agency and the draft happen before the schedule is released. Here I look at building a model that predicts wins and playoff probability (with and without a play-in) before a schedule release to mimic the building of a team as a GM would.
Before starting let’s make some assumptions clear. I am predicting in a standard 82 game season, without a bubble, every team is available to play, etc. a standard year per say. Changes could be made, such as a 72 game season, if needed. We are predicting for the time period after the NBA draft and early free agency to give an idea of how a team might perform given their players without a schedule. Simply adjusting the roster manually can show how a team might perform before free agency. A short explanation of this model to a coach would convey that this model predicts the probability that a team wins any given game based on the team’s player make-up, then it simulates 100,000 seasons and counts how often each team makes the playoffs out of 100,000. This results in the probability any given team makes the playoffs.
For the more in depth explanation: I came up with what I am calling the team genetics model. This idea was based on the way a GM might think about assembling a team. In general, the free agency period is before the schedule release causing a GM to think about assembling the best team regardless of the schedule. The data collection process started by scraping each team’s beginning of the season roster and joining each player’s previous season basic and advanced per game data. Then the top ten players by descending minutes played per game (creating a somewhat ten-man rotation) were aggregated at a team level. However, the aggregation wasn’t simply summing up all the data, it was structured how a GM might build a team. For example, I created a column for the top scorer’s PPG, the second top scorer’s PPG, and even the third; creating three columns to gauge the level of superstars on a team. Below is a quick list of the variables on a per-game basis and the thought behind them:
-
Max PPG: Top superstar scorer
-
2nd Max PPG: How far back is the second superstar?
-
3rd Max PPG: Is there a third star?
-
Sum PPG: Ten rotation players PPG
-
FT% AVg: How good is the team at shooting FTs?
-
Sum TOV: How many times is the team turning the ball over a game?
-
Max TRB: Does the team have a dominant rebounder?
-
Sum TRB: Ten rotation players TRB
-
Max AST: How many ASTs is the main distributor achieving?
-
Sum AST: Is there good ball movement or more of an Iso team?
-
Max BLK: Does the team have a rim protector?
-
Sum BLK: Ten rotation players BLK
-
Sum PF: Does the team foul more than others?
-
Average eFG: Distinguishes shooting ability
-
Max 3PAr: Does the team have a three-point specialist?
-
Average 3PAr: How often does the team shoot threes?
-
Average FTr: How often does the team get to the line?
-
Average TOV Percentage: Helps with pacing compared to total TOVs
-
Average USG: Iso team or balanced?
-
Average OBPM: Is the team generally outscoring the opponent offensively?
-
Max DBPM: Is there a defensive stopper?
-
Average DBPM: General measure of defensive ability
This process was completed for ten seasons (2010-2020) where the final training data set contained each team’s 10-man rotation with the players aggregated data from the previous season (which I am calling the team’s genetics), and the win total that came from that season. The exception was the 2020 season where the roster from before the season started and the aggregated 2019 player data was used, but not win total because this was the end prediction value. The data set was randomly split into a training and development set with a 80-20 split and a random forest which forced at least 20 predictors in a tree was performed. The final model had an mean absolute error (MAE) of 8.7%, which is about 7 games, and not too bad. 100,000 simulated seasons were ran with each team having 82 Bernoulli trials where the chance of success equaled the predicted random forest win probability for that given team. The simulations then recorded the playoff results for each season based on wins (splitting by east and west of course). Tie breakers were decided at random as there weren’t technically head to head match-ups. The final predicted number of wins was the average number of wins for a given team across the 100,000 simulations while the playoff probability was the number of times the team was the 8 seed or higher divided by 100,000.
Lastly, I needed to calculate the difference when using a play-in tournament where the 8 and 9 seed are within two games of each other. The play-in tournament is similar to a 2-game playoff series where it is assumed that the 8 seed would have home court for both games. Thus, I decided to find each team’s win percentage by sampling playoff games where the opponents differed by two games in the regular season. In total, there were 60 playoff games where the two teams differed by two games since 2000. The better seed was decided based on the new rules where the superior record achieves the better seed, not the old rules with divisional titles. Of the 60 series only 8 had the worse seed winning the first two games (about 13%). The other 87% of the time the better seed won one of the first two games, although in this play-in scenario only 16 (26%) would have even gone to a second game. This allowed for the following equations for the probabilities with a play-in:
P(Playoff given No Play-in) = number of times being top 8 seed/number of simulated seasons
P(Playoff given Play-in) = P(top 7 seed) + P(8 seed)P(Win Playin as 8 seed) + P(9 seed)P(Win Playin as 9 seed)
P(Playoff given Play-in) = P(top 7 seed) + P(8 seed)(52/60) + P(9 seed)(8/60)
Lastly, below is the table output for my final win projections and probabilities.
team_name | conference | predicted_wins | playoff_probability | playoff_playin_probability | |
---|---|---|---|---|---|
Milwaukee Bucks | East | 55 | 0.99 | 0.99 | |
Indiana Pacers | East | 53 | 0.99 | 0.99 | |
Philadelphia 76ers | East | 48 | 0.96 | 0.95 | |
Toronto Raptors | East | 47 | 0.93 | 0.93 | |
Brooklyn Nets | East | 45 | 0.88 | 0.87 | |
Boston Celtics | East | 44 | 0.80 | 0.79 | |
Miami Heat | East | 43 | 0.76 | 0.76 | |
Orlando Magic | East | 43 | 0.77 | 0.77 | |
Chicago Bulls | East | 40 | 0.47 | 0.48 | |
Atlanta Hawks | East | 36 | 0.20 | 0.20 | |
New York Knicks | East | 34 | 0.10 | 0.10 | |
Washington Wizards | East | 34 | 0.09 | 0.09 | |
Detroit Pistons | East | 32 | 0.04 | 0.04 | |
Cleveland Cavaliers | East | 30 | 0.02 | 0.02 | |
Charlotte Hornets | East | 25 | 0.00 | 0.00 | |
Denver Nuggets | West | 54 | 0.99 | 0.99 | |
Los Angeles Lakers | West | 53 | 0.99 | 0.99 | |
Utah Jazz | West | 51 | 0.99 | 0.98 | |
Houston Rockets | West | 49 | 0.97 | 0.97 | |
Los Angeles Clippers | West | 49 | 0.96 | 0.96 | |
Dallas Mavericks | West | 45 | 0.85 | 0.85 | |
Memphis Grizzlies | West | 42 | 0.58 | 0.58 | |
Oklahoma City Thunder | West | 42 | 0.63 | 0.63 | |
Sacramento Kings | West | 40 | 0.44 | 0.44 | |
Portland Trail Blazers | West | 37 | 0.17 | 0.18 | |
San Antonio Spurs | West | 37 | 0.18 | 0.18 | |
Minnesota Timberwolves | West | 35 | 0.11 | 0.11 | |
New Orleans Pelicans | West | 35 | 0.08 | 0.08 | |
Phoenix Suns | West | 32 | 0.03 | 0.03 | |
Golden State Warriors | West | 30 | 0.01 | 0.01 |