Predicting NBA Games
In this post I compare how different machine learning algorithms do at predicting the outcomes of NBA games. The post is inspired by the paper, Exploiting sports-betting market using machine learning, by Hubáček, Šourek, and Železný ([HSZ]), where they use logistic regression and neural network models to predict the outcomes of basketball games, and then devise a betting strategy based on their models. In my following posts I will explore possible betting strategies using the models from this post.
The models used in this post are logistic regression, support vector machine, nearest neighbors, and a multilayer perceptron neural network. The dataset is game statistics from the 12 NBA seasons 2007-08 through 2019-20. Each model is run on player only statistics, team only statistics, and a combination of player and team statistics. The predictions from each type of statistics (player, team, and player and team) are then compared. As a baseline model, naive predictions are made using the winning percentages and average points scored of the home and away teams. Las Vegas betting odds from the same time period are also analyzed, and predictions made by the different models are compared to the betting odds.
The data and code for the project are found on my github page here.
The Data
NBA Games Data
The NBA games data was found on kaggle.com, here. Thanks to Nathan Lauga for posting it. The data contains team and player statistics from every NBA preseason, regular season, and postseason game from the 2004-05 season through February of the 2019-20 season.
The file, games.csv, has one row for each game, with columns giving game identifiers, home team statistics, and away team statistics. For example, here are the game identifiers and home team statistics from the first five rows:
GAME_ID | DATE | SEASON | HOME_TEAM_ID | PTS_home | FG_PCT_home |
---|---|---|---|---|---|
21900895 | 2020-03-01 | 2019 | 1610612766 | 85 | 0.354 |
21900896 | 2020-03-01 | 2019 | 1610612750 | 91 | 0.364 |
21900897 | 2020-03-01 | 2019 | 1610612746 | 136 | 0.592 |
21900898 | 2020-03-01 | 2019 | 1610612743 | 133 | 0.566 |
21900899 | 2020-03-01 | 2019 | 1610612758 | 106 | 0.407 |
FT_PCT_home | FG3_PCT_home | AST_home | REB_home | HOME_TEAM_WINS |
---|---|---|---|---|
0.9 | 0.229 | 22 | 47 | 0 |
0.4 | 0.31 | 19 | 57 | 0 |
0.805 | 0.542 | 25 | 37 | 1 |
0.7 | 0.5 | 38 | 41 | 1 |
0.885 | 0.257 | 18 | 51 | 1 |
The HOME_TEAM_WINS
column, which says whether or not the home team won the game, is the column of interest to be predicted.
The file, games_details.csv, has the player statistics. It has one row for each (game, player) pair, and has columns giving (game, player) identifiers, and player statistics for that game. An example of some of the (game, player) identifiers and player statistics follows:
GAME_ID | TEAM | PLAYER_ID | MIN | FGM | FGA | PTS | REB | AST |
---|---|---|---|---|---|---|---|---|
21900895 | MIL | 202083 | 27:08 | 3 | 11 | 8 | 8 | 2 |
21900895 | MIL | 203507 | 34:55 | 17 | 28 | 41 | 20 | 6 |
21900895 | MIL | 201572 | 26:25 | 4 | 11 | 16 | 7 | 0 |
21900895 | MIL | 1628978 | 27:35 | 1 | 5 | 2 | 7 | 5 |
21900895 | MIL | 202339 | 22:17 | 2 | 8 | 4 | 1 | 2 |
To predict outcomes of games, this data was aggregated to contain teams and players cumulative statistics from all previous games each season. The new cumulative data has one row for each game with the average and total stats for each team and player going into the game. The script regular_season_stats_running_and_totals.py creates the cumulative statistics from games.csv and games_details.csv. The files, teams_running.csv and players_running.csv contain the cumulative statistics as well as the statistics in games.csv and games_details.csv. A slice of teams_running.csv with some of the home team statistics follows:
GAME_ID | HOME | HOME_WINS | HOME_LOSSES | HOME_WIN_PCT |
---|---|---|---|---|
21900895 | CHA | 21 | 38 | 0.355932 |
21900896 | MIN | 17 | 41 | 0.293103 |
21900897 | LAC | 40 | 19 | 0.677966 |
21900898 | DEN | 40 | 19 | 0.677966 |
21900899 | SAC | 25 | 34 | 0.423729 |
HOME_PTS_AVE | HOME_AST_AVE | HOME_REB_AVE |
---|---|---|
102.237 | 23.6949 | 43.0339 |
113.224 | 23.7414 | 44.931 |
115.881 | 24.0678 | 48.1186 |
110.525 | 26.3729 | 44.4915 |
108.339 | 23.3729 | 42.2542 |
Here is a slice of players_running.csv with some of the player statistics going into the game:
GAME_ID | TEAM | PLAYER_ID | MIN_AVE | PTS_AVE | REB_AVE | AST_AVE |
---|---|---|---|---|---|---|
21900895 | MIL | 202083 | 24.5375 | 7.58929 | 2.44643 | 1.5 |
21900895 | MIL | 203507 | 30.817 | 29.717 | 13.6792 | 5.81132 |
21900895 | MIL | 201572 | 26.5512 | 10.6964 | 4.41071 | 1.625 |
21900895 | MIL | 1628978 | 22.7954 | 9.05556 | 4.81481 | 2.16667 |
21900895 | MIL | 202339 | 27.248 | 15.6667 | 4.82353 | 5.47059 |
The files teams_totals.csv and players_totals.csv contain the end of the year statistics for each team and player. Some random rows from these datasets were cross-checked with the statistics at basketball-reference.com to ensure that no mistakes were made when calculating the cumulative statistics. To my knowledge the cumulative statistics of a team or player from a specific date during a season are not publicly available, so the cumulative statistics could not be checked directly.
In order to easily use all of these statistics to predict games, teams_running.csv and players_running.csv were merged into one dataset that contains for each game the cumulative team and player statistics. The player columns were numbered from player 1 to player 16, and ordered by minutes played in the game the row represents. The script that merges teams_running.csv and players_running.csv is merge_player_team.py, which outputs the merged dataset as full_stats_running.csv.
Betting Odds Data
The betting odds data was found at the website sportsbookreviewsonline.com, here. It contains the betting odds for all regular and postseason games from the 2007-08 season through March of the 2019-20 season. For each game, the odds data contains the closing over/under, spread, and money line odds. The data is organized as one csv file for each season. The 2007-08 season csv is titled nba_odds_2007-08.csv with the other seasons titled similarly. The first four games in the 2007-08 season are recorded as follows:
Date | VH | Team | Final | Close | ML |
---|---|---|---|---|---|
1030 | V | Portland | 97 | 189.5 | 900 |
1030 | H | SanAntonio | 106 | 13 | -1400 |
1030 | V | Utah | 117 | 212 | 100 |
1030 | H | GoldenState | 96 | 1 | -120 |
1030 | V | Houston | 95 | 5 | -230 |
1030 | H | LALakers | 93 | 199 | 190 |
1031 | V | Philadelphia | 97 | 191 | 255 |
1031 | H | Toronto | 106 | 6.5 | -305 |
Every two rows corresponds to one game with the away team as the first row and the home team as the second. The Close
column has the over/under as the larger number, and the spread for the game in the row of the favored team. The ML
column has the money line for each of the teams.
The odds data was organized into a new csv with one row for each game, and with the date and team names labeled as in full_stats_running.csv. In the new csv, the odds are translated into predictions for the score of the game and the probability of each team winning. The script that creates the new csv is clean_odds_data.py. It outputs one file for each season. The file for the 2007-08 season is titled odds_2007.csv, with the others titled similarly. The new files look as follows (these are the rows that correspond to the same four games as above):
DATE | HOME | AWAY | HOME_PTS | AWAY_PTS | HOME_WINS |
---|---|---|---|---|---|
2007-10-30 | SAS | POR | 106 | 97 | 1 |
2007-10-30 | GSW | UTA | 96 | 117 | 0 |
2007-10-30 | LAL | HOU | 93 | 95 | 0 |
2007-10-31 | TOR | PHI | 106 | 97 | 1 |
PRED_HOME_PTS | PRED_AWAY_PTS | PRED_HOME_WINS | HOME_WIN_PROB |
---|---|---|---|
101.25 | 88.25 | 1 | 0.933333 |
106.5 | 105.5 | 1 | 0.545455 |
97 | 102 | 0 | 0.344828 |
98.75 | 92.25 | 1 | 0.753086 |
See this post for an explanation of how these betting odds work, and how to translate them into predictions for the score and probability of each team winning.
How good is Vegas at predicting NBA games?
The script odds_analysis.py does some analysis on how well Vegas predicts outcomes of the games during this time period. In the data there are 15,211 games. Of these games, Vegas correctly predicts the winner 68.25% of the time.
The average score (home or away) error is 8.41 points, with 66.62% of the score predictions being within 10 points. 308 times either the home or away team’s score is predicted exactly correct. Twice both the home and away teams’ scores in the same game are predicted correctly. The average spread error is 6.98 points, with 35.26% of the spread predictions within 3 points. The average over/under prediction error is 13.83 points, with 45.65% of the over/under predictions within 10 points.
Models and Predictions
Pre-processing
Various pre-processing was done to the data above before it was input to the machine learning algorithms.
The first 10 games of each season for each team were taken out of the data for small sample size reasons. The cut off of 10 games seems like a reasonable number because it eliminates games with minimal cumulative statistics without eliminating a substantial portion of the total games.
To make the betting odds data and NBA games data match, the years before 2007 were taken out of the NBA games data, and any preseason and postseason games were removed from the NBA games data and betting odds data.
For each game and each team, only the top 9 players in order of minutes played were kept. This way only the more relevant players’ statistics are considered. One may disagree with the convention to apriori keep players based on their minutes played in the game one wants to predict. The reason for this convention is to attempt to recreate the information a potential bettor has at the beginning of a game about the players that will play in the game. Cumulative statistics do not capture when a player misses a game due to injury or suspension. On the other hand, someone betting on the game, as well as Las Vegas when making the closing odds, has the information of who is active for a game. Knowing the active players for a game, a basketball fan could predict with good accuracy which 9 players would play the highest minutes.
The script pre_processing.py does all of the above to full_stats_running.csv, and then creates input and output dataframes for the machine learning models. The script creates one dataframe for player statistics, one for team statistics, and one for player and team statistics. It also creates the training, validation, and test splits, which are discussed in the following section.
Finally, the support vector machine and multi-layer perceptron models have the extra pre-processing step of scaling the features to make them uniform. The data was scaled using StandardScaler
from sklearn.preprocessing
to make the columns have mean 0 and variance 1.
Training and Validation
The naive models, described in the following section, were run on all games since there is no training and validation for these models.
For the machine learning models, the 2018-19 and 2019-20 seasons were held out as a test set, and the 2007-08 through 2017-18 seasons were split 80%, 20% for training and validation respectively. In the training, validation, and test data there are 9,113 games, 2,279 games, and 1,790 games respectively.
For each of the machine learning models, the model hyperparameters were tuned using the training and and validation data to prevent overfitting and underfitting. Each model has it’s own script that does this, with obvious naming conventions. The script test.py runs the models with the determined hyperparameters on the training, validation, and test data. For each model the percent correct on the training, validation, and test data is recorded below.
Naive Models
The following three naive prediction models were used:
- Points per game: predict the winner of the game by the team with the higher average points per game.
- Point differential: predict the winner of the game by the team with the higher average point differential. Average point differential is average points per game minus averaged points per game allowed.
- Winning percentage: predict the winner of the game by the team with the higher winning percentage.
The script that carries out the naive models is naive_predictions.py.
The results of these models as well as the predictions made by the betting odds are recorded in the following table:
Prediction Method | Percent Correct |
---|---|
Vegas Odds | 68.25% |
Winning Percentage | 65.54% |
Point Differential | 65.49% |
Points Per Game | 58.94% |
Each of the naive models was also run using home and away splits for each team instead of the total average. This means for the home team, only their previous home game averages were used, and for the away team, only their previous away game averages. These are the results with the home and away splits:
Using home and away splits | Percent Correct |
---|---|
Winning Percentage | 66.12% |
Point Differential | 57.58% |
Points per Game | 55.35% |
Logistic Regression
LogisticRegression
from sklean.linear_model
was used with default parameters.
Stats Group \ Percent Correct | Training Data | Validation Data | Test Data |
---|---|---|---|
Player | 70.57% | 67.40% | 65.25% |
Team | 67.68% | 67.79% | 65.64% |
Player and Team | 70.67% | 68.19% | 65.64% |
Nearest Neighbor
KNeighborsClassifier
from sklearn.neighbors
was used with the number of neighbors parameter, n_neighbors=100
.
Stats Group \ Percent Correct | Training Data | Validation Data | Test Data |
---|---|---|---|
Player | 67.92% | 66.78% | 64.30% |
Team | 68.32% | 67.49% | 65.08% |
Player and Team | 68.44% | 67.53% | 65.70% |
Support Vector Machine
LinearSVC
from sklearn.svm
with regularization parameter C=1
was used. Various nonlinear support vector machines were used on the training and validation data, all performing similarly or worse than the linear support vector machine.
Stats Group \ Percent Correct | Training Data | Validation Data | Test Data |
---|---|---|---|
Player | 70.59% | 67.44% | 65.25% |
Team | 67.85% | 67.88% | 65.92% |
Player and Team | 70.67% | 68.32% | 65.53% |
Multi-layer Perceptron (MLP)
Model parameter details: The model has 5 dense layers with the first through fifth layer having 100, 100, 50, 25, and 10 neurons respectively. For these 5 dense layers the activation function tanh is used. The output layer has 1 neuron with activation function sigmoid, since binary classification is being done. For regularization, a dropout rate of 0.2 is used for each dense layer and an l2 regularization rate of 0.004 is used for every layer.
The model was made and ran using keras
with the tensorflow
backend.
Stats Group \ Percent Correct | Training Data | Validation Data | Test Data |
---|---|---|---|
Player | 69.68% | 67.71% | 65.64% |
Team | 67.43% | 67.00% | 65.42% |
Player and Team | 69.60% | 66.74% | 65.57% |
Conclusions and Future Work
All the models performed about the same on the training, validation and test data, with numbers on the test data similar to the best naive model. None of the models beat the betting odds’ predictions on the test data. In general, the player only and player and team statistics groups did better than the team only only statistic group. This is not surprising because the player statistics are more detailed and numerous.
It is interesting that all the models performed similarly. This suggests that they are all finding the same patterns in the data.
The numbers here are consistent with the survey of related work on the subject in section 2 of [HSZ], as well as being consistent with the predictions made in [HSZ].
The reason the 2018-19 and 2019-20 seasons are held out as the test set is to create the following betting scenario: one creates a model using all past data available, and then applies the model to the current season (or most recent two seasons in this case). An alternative method to try in the future is the following: create the same models using the same training and validation data here, and then update these models (by retraining them) once every week during the test data seasons with the data from the week added to the training and validation sets. This recreates what one might actually do in a betting scenario. Another alternative method would be to use smaller training and validation sets, perhaps use only the season or two before the test data season. It is possible that changing trends in NBA game play over the 10 season period of the training and validation data makes the data less susceptible to predictive modeling. Making the training data only from the season or two before the test data removes this issue. On the other hand, a smaller training and validation set brings its own difficulties in making a good predictive model.
At the outset of the project, I hoped to build a multilayer perceptron (MLP) model that would beat or at least be as good as the predictions made by the betting odds data. This clearly did not happen. In the future I hope to make a more robust MLP model by introducing more advanced statistics into the data. Another possible way to improve the MLP model is to incorporate a convolutional layer. There are many possible ways to do this, and one method is carried out in [HSZ].
Reference
[HSZ] O. Hubáček, G Šourek, F. Železný, Exploiting sports-betting market using machine learning, International Journal of Forecasting, 35 (2019), 783-796.