In this post I compare how different machine learning algorithms do at predicting the outcomes of NBA games. The post is inspired by the paper, Exploiting sports-betting market using machine learning, by Hubáček, Šourek, and Železný ([HSZ]), where they use logistic regression and neural network models to predict the outcomes of basketball games, and then devise a betting strategy based on their models. In my following posts I will explore possible betting strategies using the models from this post.

The models used in this post are logistic regression, support vector machine, nearest neighbors, and a multilayer perceptron neural network. The dataset is game statistics from the 12 NBA seasons 2007-08 through 2019-20. Each model is run on player only statistics, team only statistics, and a combination of player and team statistics. The predictions from each type of statistics (player, team, and player and team) are then compared. As a baseline model, naive predictions are made using the winning percentages and average points scored of the home and away teams. Las Vegas betting odds from the same time period are also analyzed, and predictions made by the different models are compared to the betting odds.

The data and code for the project are found on my github page here.

The Data

NBA Games Data

The NBA games data was found on kaggle.com, here. Thanks to Nathan Lauga for posting it. The data contains team and player statistics from every NBA preseason, regular season, and postseason game from the 2004-05 season through February of the 2019-20 season.

The file, games.csv, has one row for each game, with columns giving game identifiers, home team statistics, and away team statistics. For example, here are the game identifiers and home team statistics from the first five rows:

GAME_ID	DATE	SEASON	HOME_TEAM_ID	PTS_home	FG_PCT_home
21900895	2020-03-01	2019	1610612766	85	0.354
21900896	2020-03-01	2019	1610612750	91	0.364
21900897	2020-03-01	2019	1610612746	136	0.592
21900898	2020-03-01	2019	1610612743	133	0.566
21900899	2020-03-01	2019	1610612758	106	0.407

FT_PCT_home	FG3_PCT_home	AST_home	REB_home	HOME_TEAM_WINS
0.9	0.229	22	47	0
0.4	0.31	19	57	0
0.805	0.542	25	37	1
0.7	0.5	38	41	1
0.885	0.257	18	51	1

The HOME_TEAM_WINS column, which says whether or not the home team won the game, is the column of interest to be predicted.

The file, games_details.csv, has the player statistics. It has one row for each (game, player) pair, and has columns giving (game, player) identifiers, and player statistics for that game. An example of some of the (game, player) identifiers and player statistics follows:

GAME_ID	TEAM	PLAYER_ID	MIN	FGM	FGA	PTS	REB	AST
21900895	MIL	202083	27:08	3	11	8	8	2
21900895	MIL	203507	34:55	17	28	41	20	6
21900895	MIL	201572	26:25	4	11	16	7	0
21900895	MIL	1628978	27:35	1	5	2	7	5
21900895	MIL	202339	22:17	2	8	4	1	2

To predict outcomes of games, this data was aggregated to contain teams and players cumulative statistics from all previous games each season. The new cumulative data has one row for each game with the average and total stats for each team and player going into the game. The script regular_season_stats_running_and_totals.py creates the cumulative statistics from games.csv and games_details.csv. The files, teams_running.csv and players_running.csv contain the cumulative statistics as well as the statistics in games.csv and games_details.csv. A slice of teams_running.csv with some of the home team statistics follows:

GAME_ID	HOME	HOME_WINS	HOME_LOSSES	HOME_WIN_PCT
21900895	CHA	21	38	0.355932
21900896	MIN	17	41	0.293103
21900897	LAC	40	19	0.677966
21900898	DEN	40	19	0.677966
21900899	SAC	25	34	0.423729

HOME_PTS_AVE	HOME_AST_AVE	HOME_REB_AVE
102.237	23.6949	43.0339
113.224	23.7414	44.931
115.881	24.0678	48.1186
110.525	26.3729	44.4915
108.339	23.3729	42.2542

Here is a slice of players_running.csv with some of the player statistics going into the game:

GAME_ID	TEAM	PLAYER_ID	MIN_AVE	PTS_AVE	REB_AVE	AST_AVE
21900895	MIL	202083	24.5375	7.58929	2.44643	1.5
21900895	MIL	203507	30.817	29.717	13.6792	5.81132
21900895	MIL	201572	26.5512	10.6964	4.41071	1.625
21900895	MIL	1628978	22.7954	9.05556	4.81481	2.16667
21900895	MIL	202339	27.248	15.6667	4.82353	5.47059

The files teams_totals.csv and players_totals.csv contain the end of the year statistics for each team and player. Some random rows from these datasets were cross-checked with the statistics at basketball-reference.com to ensure that no mistakes were made when calculating the cumulative statistics. To my knowledge the cumulative statistics of a team or player from a specific date during a season are not publicly available, so the cumulative statistics could not be checked directly.

In order to easily use all of these statistics to predict games, teams_running.csv and players_running.csv were merged into one dataset that contains for each game the cumulative team and player statistics. The player columns were numbered from player 1 to player 16, and ordered by minutes played in the game the row represents. The script that merges teams_running.csv and players_running.csv is merge_player_team.py, which outputs the merged dataset as full_stats_running.csv.

Betting Odds Data

The betting odds data was found at the website sportsbookreviewsonline.com, here. It contains the betting odds for all regular and postseason games from the 2007-08 season through March of the 2019-20 season. For each game, the odds data contains the closing over/under, spread, and money line odds. The data is organized as one csv file for each season. The 2007-08 season csv is titled nba_odds_2007-08.csv with the other seasons titled similarly. The first four games in the 2007-08 season are recorded as follows:

Date	VH	Team	Final	Close	ML
1030	V	Portland	97	189.5	900
1030	H	SanAntonio	106	13	-1400
1030	V	Utah	117	212	100
1030	H	GoldenState	96	1	-120
1030	V	Houston	95	5	-230
1030	H	LALakers	93	199	190
1031	V	Philadelphia	97	191	255
1031	H	Toronto	106	6.5	-305

Every two rows corresponds to one game with the away team as the first row and the home team as the second. The Close column has the over/under as the larger number, and the spread for the game in the row of the favored team. The ML column has the money line for each of the teams.

The odds data was organized into a new csv with one row for each game, and with the date and team names labeled as in full_stats_running.csv. In the new csv, the odds are translated into predictions for the score of the game and the probability of each team winning. The script that creates the new csv is clean_odds_data.py. It outputs one file for each season. The file for the 2007-08 season is titled odds_2007.csv, with the others titled similarly. The new files look as follows (these are the rows that correspond to the same four games as above):

DATE	HOME	AWAY	HOME_PTS	AWAY_PTS	HOME_WINS
2007-10-30	SAS	POR	106	97	1
2007-10-30	GSW	UTA	96	117	0
2007-10-30	LAL	HOU	93	95	0
2007-10-31	TOR	PHI	106	97	1

PRED_HOME_PTS	PRED_AWAY_PTS	PRED_HOME_WINS	HOME_WIN_PROB
101.25	88.25	1	0.933333
106.5	105.5	1	0.545455
97	102	0	0.344828
98.75	92.25	1	0.753086

See this post for an explanation of how these betting odds work, and how to translate them into predictions for the score and probability of each team winning.

How good is Vegas at predicting NBA games?

The script odds_analysis.py does some analysis on how well Vegas predicts outcomes of the games during this time period. In the data there are 15,211 games. Of these games, Vegas correctly predicts the winner 68.25% of the time.

The average score (home or away) error is 8.41 points, with 66.62% of the score predictions being within 10 points. 308 times either the home or away team’s score is predicted exactly correct. Twice both the home and away teams’ scores in the same game are predicted correctly. The average spread error is 6.98 points, with 35.26% of the spread predictions within 3 points. The average over/under prediction error is 13.83 points, with 45.65% of the over/under predictions within 10 points.

Models and Predictions

Pre-processing

Various pre-processing was done to the data above before it was input to the machine learning algorithms.

The first 10 games of each season for each team were taken out of the data for small sample size reasons. The cut off of 10 games seems like a reasonable number because it eliminates games with minimal cumulative statistics without eliminating a substantial portion of the total games.

To make the betting odds data and NBA games data match, the years before 2007 were taken out of the NBA games data, and any preseason and postseason games were removed from the NBA games data and betting odds data.

For each game and each team, only the top 9 players in order of minutes played were kept. This way only the more relevant players’ statistics are considered. One may disagree with the convention to apriori keep players based on their minutes played in the game one wants to predict. The reason for this convention is to attempt to recreate the information a potential bettor has at the beginning of a game about the players that will play in the game. Cumulative statistics do not capture when a player misses a game due to injury or suspension. On the other hand, someone betting on the game, as well as Las Vegas when making the closing odds, has the information of who is active for a game. Knowing the active players for a game, a basketball fan could predict with good accuracy which 9 players would play the highest minutes.

The script pre_processing.py does all of the above to full_stats_running.csv, and then creates input and output dataframes for the machine learning models. The script creates one dataframe for player statistics, one for team statistics, and one for player and team statistics. It also creates the training, validation, and test splits, which are discussed in the following section.

Finally, the support vector machine and multi-layer perceptron models have the extra pre-processing step of scaling the features to make them uniform. The data was scaled using StandardScaler from sklearn.preprocessing to make the columns have mean 0 and variance 1.

Training and Validation

The naive models, described in the following section, were run on all games since there is no training and validation for these models.

For the machine learning models, the 2018-19 and 2019-20 seasons were held out as a test set, and the 2007-08 through 2017-18 seasons were split 80%, 20% for training and validation respectively. In the training, validation, and test data there are 9,113 games, 2,279 games, and 1,790 games respectively.

For each of the machine learning models, the model hyperparameters were tuned using the training and and validation data to prevent overfitting and underfitting. Each model has it’s own script that does this, with obvious naming conventions. The script test.py runs the models with the determined hyperparameters on the training, validation, and test data. For each model the percent correct on the training, validation, and test data is recorded below.

Naive Models

The following three naive prediction models were used:

Points per game: predict the winner of the game by the team with the higher average points per game.
Point differential: predict the winner of the game by the team with the higher average point differential. Average point differential is average points per game minus averaged points per game allowed.
Winning percentage: predict the winner of the game by the team with the higher winning percentage.

The script that carries out the naive models is naive_predictions.py.

The results of these models as well as the predictions made by the betting odds are recorded in the following table:

Prediction Method	Percent Correct
Vegas Odds	68.25%
Winning Percentage	65.54%
Point Differential	65.49%
Points Per Game	58.94%

Each of the naive models was also run using home and away splits for each team instead of the total average. This means for the home team, only their previous home game averages were used, and for the away team, only their previous away game averages. These are the results with the home and away splits:

Using home and away splits	Percent Correct
Winning Percentage	66.12%
Point Differential	57.58%
Points per Game	55.35%

Logistic Regression

LogisticRegression from sklean.linear_model was used with default parameters.

Stats Group \ Percent Correct	Training Data	Validation Data	Test Data
Player	70.57%	67.40%	65.25%
Team	67.68%	67.79%	65.64%
Player and Team	70.67%	68.19%	65.64%

Nearest Neighbor

KNeighborsClassifier from sklearn.neighbors was used with the number of neighbors parameter, n_neighbors=100.

Stats Group \ Percent Correct	Training Data	Validation Data	Test Data
Player	67.92%	66.78%	64.30%
Team	68.32%	67.49%	65.08%
Player and Team	68.44%	67.53%	65.70%

Support Vector Machine

LinearSVC from sklearn.svm with regularization parameter C=1 was used. Various nonlinear support vector machines were used on the training and validation data, all performing similarly or worse than the linear support vector machine.

Stats Group \ Percent Correct	Training Data	Validation Data	Test Data
Player	70.59%	67.44%	65.25%
Team	67.85%	67.88%	65.92%
Player and Team	70.67%	68.32%	65.53%

Multi-layer Perceptron (MLP)

Model parameter details: The model has 5 dense layers with the first through fifth layer having 100, 100, 50, 25, and 10 neurons respectively. For these 5 dense layers the activation function tanh is used. The output layer has 1 neuron with activation function sigmoid, since binary classification is being done. For regularization, a dropout rate of 0.2 is used for each dense layer and an l2 regularization rate of 0.004 is used for every layer.

The model was made and ran using keras with the tensorflow backend.

Stats Group \ Percent Correct	Training Data	Validation Data	Test Data
Player	69.68%	67.71%	65.64%
Team	67.43%	67.00%	65.42%
Player and Team	69.60%	66.74%	65.57%

Conclusions and Future Work

All the models performed about the same on the training, validation and test data, with numbers on the test data similar to the best naive model. None of the models beat the betting odds’ predictions on the test data. In general, the player only and player and team statistics groups did better than the team only only statistic group. This is not surprising because the player statistics are more detailed and numerous.

It is interesting that all the models performed similarly. This suggests that they are all finding the same patterns in the data.

The numbers here are consistent with the survey of related work on the subject in section 2 of [HSZ], as well as being consistent with the predictions made in [HSZ].

The reason the 2018-19 and 2019-20 seasons are held out as the test set is to create the following betting scenario: one creates a model using all past data available, and then applies the model to the current season (or most recent two seasons in this case). An alternative method to try in the future is the following: create the same models using the same training and validation data here, and then update these models (by retraining them) once every week during the test data seasons with the data from the week added to the training and validation sets. This recreates what one might actually do in a betting scenario. Another alternative method would be to use smaller training and validation sets, perhaps use only the season or two before the test data season. It is possible that changing trends in NBA game play over the 10 season period of the training and validation data makes the data less susceptible to predictive modeling. Making the training data only from the season or two before the test data removes this issue. On the other hand, a smaller training and validation set brings its own difficulties in making a good predictive model.

At the outset of the project, I hoped to build a multilayer perceptron (MLP) model that would beat or at least be as good as the predictions made by the betting odds data. This clearly did not happen. In the future I hope to make a more robust MLP model by introducing more advanced statistics into the data. Another possible way to improve the MLP model is to incorporate a convolutional layer. There are many possible ways to do this, and one method is carried out in [HSZ].

Reference

[HSZ] O. Hubáček, G Šourek, F. Železný, Exploiting sports-betting market using machine learning, International Journal of Forecasting, 35 (2019), 783-796.