This is the third article in a series on predicting NBA Games. Just FYI, this post will be the most technical to date, but I’m more than happy to answer any questions you may have.
Links:
Repository
Predicting NBA Games Part 1: Announcement
Predicting NBA Games Part 2: The How and the Why
Intro
In this post, we’ll overview applied linear regression. We’ll discuss the data set, model, model validation, and results. Consider this an abridged version of a stats 101 section on regression. It won’t cover every detail, but it will touch on the major points.
We will compare the model’s results to betting lines. The betting lines serve as alternative hypotheses we can compare our predictions to. I have not used the model to bet on any games, but you may if you’d like.
To see if we can beat Vegas, this project asks two questions: In a match-up between two NBA teams, which team will win and by how much? And, how does that prediction compare against a betting line? To finish, we’ll examine the 2019 NBA finals where the Raptors won the championship over the Golden State Warriors. Did the model beat Vegas? Read to the bottom to find out.
Model Outline
The regression uses a proven data set and simple model. The data set is derived from Dean Oliver’s four factors. Oliver’s four factors were among the first in basketball’s “moneyball” revolution, and they’ve been solid predictors of teams’ win totals for a season. I’ve not seen an example of the four factors used to predict individual games. So, it’s an opportunity to try a proven data set on a new problem.
The four factors are shooting, rebounding, turnovers, and free throws. Combined, they capture most outcomes of a possession in basketball. Exceptions, like when a team holds the ball at game’s end, are rare.
Each of the four factors is measured by an associated statistic:
| Factor | Statistic |
| Shooting | Effective Field Goal Percentage (eFG%) |
| Rebounding | Rebound Percentage (Offensive: ORB%, Defensive: DRB%) |
| Turnovers | Turnover Percentage (TOV%) |
| Free Throws | Free throw rate (FTR) |
Basketball Reference explains the factors and provides the formula for these statistics.
Though we call these statistics the “Four Factors”, there are actually eight factors. The Warriors have a turnover percentage, but they also force a percentage of turnovers from their opponents. Thus, we have offensive and defensive turnover percentage factors. Shooting and free throws work the same way. Rebounding is divided into offensive and defensive rebound percentage. Each team then has eight factors which serve as variables in the model.
However, to predict a single game, we need to compare the eight factors from two teams. We then have 16 variables for each game. Yet, we are still missing a key variable: home field advantage. In the NBA, the home team wins 60% of the time, so home court should be in the model. To account for home court, I coded each team’s factors as home (h) or away (a) for each game. With this inclusion, our explanatory variables, with OPP. short for opponent, are as follows:
| Home Variables | Away Variables |
| eFG% H | eFG% A |
| ORB% H | ORB% A |
| TOV% H | TOV% A |
| FTR H | FTR A |
| Opp. eFG% H | Opp. eFG% A |
| DRB% H | DRB% A |
| OPP. TOV% H | OPP. TOV% A |
| OPP. FTR H | OPP. FTR A |
We use these explanatory variables to determine the Margin of Victory (MOV) for the home team. The away team MOV is the inverse of home MOV. Finally, we have 1325 games from the 2019-2020 NBA season, so N = 1325. With the data set determined, here’s our sample linear model:
That formula means we assume the home margin of victory, Y, of each game is the linear combination of an intercept, the sixteen factors (eight per team), and their coefficients, , plus error
.
Results Overview
To evaluate the model, we follow several lines of inquiry. We overview the outputs, check the assumptions of regression, and evaluate the model’s prediction accuracy.
First, the P-value of the F-statistic is approximately 0. This indicates the model is a better estimate of MOV than the average MOV. We know our variables make for effective predictions of a team’s total wins during a season, so it would be strange if the F-statistic was not significant for individual games.
Next, the r-squared (R^2) value of the regression is 0.219 while the adjusted R^2 is 0.209. These stats are estimates of the proportion of variance in MOV captured by the regression. We can read adjusted R^2 as, “The regression captures 20.9% of the variation in home team margin of victory in the data set”
Adjusted R^2 is the preferred statistic. If we throw random variables into the regression, some will randomly help predict MOV. To account for this, adjusted R^2 penalizes the regression for having lots of variables.
Let’s see how the variables affect our predictions. In the following table, we have summary information for each variable. The expectation column holds my expectation for the direction of each coefficient. For example, I expect home effective field goal percentage (eFG%_h) to increase the margin of victory. If it didn’t, we may need to further inspect the model. The coefficients define how a change in a variable would be expected to affect the MOV. In the final two columns, the P-value of each coefficient and it’s significance is shown.
| Variable | Expectation | Coefficient | P-Value | p<0.05 |
| const | Null | -82.86 | 0.29 | ❌ |
| eFG%_h | ➕ | 182.93 | 0 | ✅ |
| TOV%_h | ➖ | -2.16 | 0 | ✅ |
| ORB%_h | ➕ | 0.77 | 0 | ✅ |
| FTR_h | ➕ | 81.74 | 0 | ✅ |
| opp_eFG%_h | ➖ | -113.22 | 0.004 | ✅ |
| opp_TOV%_h | ➕ | 1.78 | 0 | ✅ |
| DRB%_h | ➕ | 0.52 | 0.088 | ❌ |
| opp_FTR_h | ➖ | -53.73 | 0.101 | ❌ |
| eFG_pct_a | ➖ | -153.78 | 0 | ✅ |
| TOV%_a | ➕ | 1.20 | 0.046 | ✅ |
| ORB%_a | ➖ | -0.25 | 0.212 | ❌ |
| FTR_a | ➖ | 8.00 | 0.703 | ❌ |
| opp_eFG%_a | ➕ | 167.02 | 0 | ✅ |
| opp_TOV%_a | ➖ | -0.61 | 0.215 | ❌ |
| DRB%_a | ➖ | -0.24 | 0.433 | ❌ |
| opp_FTR_a | ➖ | -6.76 | 0.836 | ❌ |
Let’s start with the expectations. This is a sanity check to ensure each variable affects MOV how we expect it to. Fifteen out of sixteen variables meet our expectations. That’s a good sign. The exception is FTR away which is +8 versus our negative expectation. I’m not worried about this for three reasons:
- The 95% confidence interval for free throw rate is -33 — 48. The true effect of FTR could be negative as expected.
- I can create a counterfactual scenario where FTR is positive. The Warriors defense could be skilled at fouling teams when they would otherwise have a wide open layup. This would drop the expected points from the shot attempt and lead to a beneficial outcome from a foul. I cannot create a similar counter factual for eFG%.
- Prior research indicates FTR is the least impactful of the four factors and the most likely to have an effect near zero.
Next, I’m inclined to say p-values don’t matter in this context though I welcome alternate interpretations. We are not running an experiment to evaluate if eFG% has a significant effect on MOV. Rather, we aim to predict the outcomes of a set of games. A variable’s significance does not guarantee it’s a better predictor than a less significant alternative..
So far, the model appears to be in good shape.
The Assumptions of Regression
The formula for regression makes four assumptions: Fixed X, Independence, Homoscedasticity, and Normality. To determine if a regression is the right fit for the data, we need to check each assumption.
We have no reason to expect the fixed X or independence assumptions to be violated. Thus, we’ll check them as a group. Then, we’ll handle homoscedasticity and normality. The homoscedasticity and normality assumptions play a key role in making accurate predictions, so we will explore them in more depth.
Fixed X and Independence
The fixed X assumption assumes our variables are measured accurately. If the Raptors eFG% is recorded as 52%, we believe it is actually 52%. We’d have to track this data manually to see if the data’s correct, so I’ll go ahead and trust Basketball Reference gets the stats right. Fixed X: ✅.
Independence assumes there’s not a correlation between residuals. Independence is most important to check when the data is across time or space. Fortunately, our data is neither. But, we still evaluate the independence assumption to make sure it holds.
We’ve two methods for evaluating independence: the Durbin-Watson statistic and a residual independence plot. The Durbin-Watson statistic is a measure of correlation between consecutive residuals. It ranges from 0-4, and 2 indicates independence is met. In our model, the Durbin-Watson statistic equals 1.975. That number is close to 2, so we take that as one sign we’ve met the independence assumption.
The residual independence plot visualizes the relationship between row number and residuals. Does the 121st game in a season correlate with the 122nd? We can see in the residual independence plot below that there’s no clear pattern between row numbers and residuals. That’s exactly what we’re looking for.

Both the Durbin-Watston statistic and the residual independence plot indicate we’ve met the independence assumption. Independence: ✅.
Homoscedasticity and Normality
Homoscedasticity and normality are related. Together, they make the last term in our linear model: .
Let’s break that down. represents our residuals. We presume that the residuals are approximately normally distributed, signified by N. In the parentheses, 0 is the mean of the normal distribution, and thus, the residuals. The second term,
, is the variance of the normal distribution.
It’s a property of the regression equation that the mean of the residual distribution is zero. However, the variance can be different for every dataset. The variance determines the width of the normal distribution. Typically, we discuss the width of the normal curve in terms of standard deviation rather than variance. To calculate standard deviation, take the square root of the variance. Thus:
where s = standard deviation.
In Figure 2, we see how different standard deviations affect the width of normal curves.

In practice, normal curves are centered on the regression line. If our regression predicts the Raptors to beat the Warriors by 15, the normal curve is centered on 15. Here’s where betting lines come in. Let’s say the line is Warriors -5. That means Vegas expects the Warriors to win by 5 (here’s a more thorough explanation of betting lines). The line is 20 points lower than our prediction of Raptors +15. If we used the large standard deviation blue distribution, -20 would seem reasonable. However, if we use the skinny red distribution, -20 might surprise us.
As shown, we compare the distribution of our residuals to the betting line to estimate it’s likelihood. However, the comparison is only valid if the homoscedasticity and normality assumptions are met.
Rather variance is large or small, the homoscedasticity assumption asserts that big predictions (+22), small predictions (-2), or any other prediction (-10) all have the same variance. Referencing figure 2, the blue or red distribution may be valid, but only if the chosen distribution applies to all predictions.
Formally, the homoscedasticity assumption assumes the variance of Y is equal for all values of X. In this model, that means the variance in MOV is the same for every prediction of the model. Residuals are the difference between the true MOV, Y, and our prediction, so we look at the residuals to gauge homoscedasticity.
To evaluate homoscedasticity, we use a residual versus fitted values plot. We look for areas where our predictions are correlated with the error. For example, if every game we predicted as +5 had a real MOV of -6, the homoscedasticity assumption would be violated. The ideal residuals versus fitted plot tends to show an ellipse. And as shown in Figure 2, most of the residuals falls within an ellipse.

*The purple ellipse is shown for illustrative purposes only. It does not reflect any property of the residuals.
**In the residual distribution, the X axis for the histogram and normal curve are offset (technical issues: 😑).
Yet, some points fall outside of the ellipse. Are these problematic? The answer is subjective, but I think not. The residuals outside of the ellipse for fitted values less than zero are my only concern. Is there more variance for negative predictions? Maybe. But with just four points out of 1325, I don’t think there’s enough evidence to claim the homoscedasticity assumption has been violated. Thus, homoscedasticity: ✅.
You may have figured out the normality assumption already. But formally, normality assumes the errors are normally distributed for every X value.
We’ve already presented some evidence of normality in figure 3. The residual distribution shows the residuals appear to follow a normal distribution. If the residuals were skewed or bi-modal, we may suspect a violation of normality.
The normal QQ plot, as shown in figure 4, also tests normality. The QQ plot visualizes how close the data follows a normal distribution. If the data is perfectly normal, it will lie on the red line.

Our dataset isn’t perfectly normal, but it’s darn close. In my estimation, the QQ and residual distribution plots are sufficient to prove normality of the residuals. Thus, normality: ✅.
We’ve validated the assumptions of regression, so we’re confident we have the right linear model for predicting MOV. With the homoscedasticity and normality assumptions, we are confident that our residuals are normal with equal variance. With these assumptions met, we can evaluate a betting line’s likelihood in comparison to our prediction.
Making Predictions
Now, we get to test how well our model predicts games. Our predictions are only valid if the games we predict have not been included in the dataset. In short, if we predict a game on March 22nd, we only use data from March 21st and prior. The system that manages this was not made at the start of the season, so we have to use the subset of games it was available for.
We have two types of predictions we want to evaluate. First, we want to see how many of the winners and losers we correctly predict. We have data for 252 of these games. And second, we want to check our performance against a betting line. We have 248 examples to check against betting lines.
The model predicted the correct winner in ~64% of games. That’s a lot better than a coin flip. But remember, if we just predicted the home team to win every game, we’d be correct ~60% of the time. Does the model just pick the home team? If we take a second look at the coefficient table, the away team variables are less significant than their home team counterparts. But we’ve not checked for multicollinearity, an important step when assessing coefficient significance. We’ll have to answer this question another day
Our R^2 is relatively low at 0.219. That means 78% of the variance in MOV is explained by factors outside the model.

In Figure 5, we see actual outcomes have a wide dispersion around our predictions as we’d expect given our low R^2. The wide dispersion comes from the residual standard deviation which is approximately 13. If we applied a method to reduce the variance (standard deviation) of the model, we could achieve more precise estimates. However, we often introduce bias into the model when we reduce variance.
We can work through an example to put standard deviation into practical terms. Let’s predict the Raptors to beat the Warriors by 13, and ask what’s the likelihood the actual result is outside of +/- one standard deviation? Since the standard deviation is 13, we have a range, 0-26, within one standard deviation of our prediction. If we calculate the probability of an outcome outside of that range, the result is ~30%. And by extension, there’s a ~15% chance a team projected to win by 13 will lose.
What about the likelihood the score is within +/- 5 of the predicted results? That’s only a ~30% chance. Any given prediction is not likely to get the score right or even be close. If we want more precise estimates, we need something else: a different dataset, more data, a different model, etc.
What about our performance against a spread, a type of betting line? We beat the spread 52.5% of the time. That’s unexpected. Across 248 games, the model performed better than Vegas!
Here’s how a win or loss versus Vegas is calculated. If a spread predicts the Raptors to win by 10 and the model predicts the Raptors to only win by 5, we bet on the Warriors. That means every outcome from Raptors by 9 to Warriors by infinity is in the model’s favor. All outcomes from Raptors by 11 to Raptors by infinity is in Vegas’s favor. A push, where the line equals the result, would occur when the Raptor’s win by 10. That means it’s a tie and no one wins. This happened four times, and we drop these games from the data.
Above, we calculated the likelihood that a game’s MOV would fall in a certain range. With similar methods, we can estimate the likelihood of a spread. We use either a cumulative density function (CDF) or a survival function (SF). CDF ‘s or SF’s return the probability of all outcomes below (CDF) or above (SF) a given point in the distribution.
If the spread is greater than the estimate, such as in Figure 6, we use a SF to calculate the likelihood that the MOV is in Vegas’s favor. If the spread is less than the estimate, we use a CDF to calculate the expected that the MOV is in Vegas’s favor.

We use a CDF or SF for every game. It could be the case that a betting line far from the estimate will be more likely to result in a win than a line close to the estimate. If one used this model to bet on games, this would be one way to choose which games to bet on. To verify this claim, we could employ logistic regression.
In summary, for every prediction, we expect the actual result to reside in a normal curve centered on the prediction with a standard deviation of 13. Then, we can use a CDF or SF to estimate the likelihood of outcomes in favor of the model or in favor of Vegas. When predicting winners and losers, we perform fairly well. And versus the spread, the model takes the right side of the bet 52.5% of the time.
The Finals
Now that we’ve described the model, let’s look at how it performed in the finals. The following table contains the results for each game:
| Home Team | Away Team | Line | Prediction | P(line) | H. MOV | Result | |
| 1 | Raptors | Warriors | -1 | 0.5 | 0.45 | 9 | W |
| 2 | Raptors | Warriors | 2 | 0.6 | 0.46 | -5 | W |
| 3 | Warriors | Raptors | 5.5 | 3.2 | 0.43 | -14 | W |
| 4 | Warriors | Raptors | 4.5 | 3.0 | 0.45 | -13 | W |
| 5 | Raptors | Warriors | 3 | 0.5 | 0.42 | -1 | W |
| 6 | Warriors | Toronto | 2.5 | 2.8 | 0.49 | -4 | L |
The model beat the spread in 5 out of 6 games in the finals. That’s good! It’s also lucky. Given the stated probabilities, we’d only expect to get the first 5 games right 5% of the time. In time, the results would regress to the mean. But what is that mean?
For available games, the model beat the spread 52.5% of the time. Is that the mean? That’s a higher number than I’d expect. I’ve a few suspicions that 52.5% may be an invalid number. For one, it seems strange a simple model would beat Vegas, and all its resources, with such consistency. There’s a large array of available variables that the model doesn’t account for: back-to-back games, injuries, trades, or roster changes for a start. Vegas, I’m sure, applies this additional data in their models.
In addition, it’s possible that I predicted games with data that already incorporated the results of prior games. This has to do with database work, and I’ll discuss that in a future post. For now, the results are what they are.
Conclusion
This is the only regression I’ve run that worked straight out of the box. Usually, one of the assumptions of regression gets broken, and it requires some change in the data or the model to fix it. But I’m to use a simple model if it works.
The linear model provides an initial proof of concept: Can we scrape the data, make predictions, and store the results in a database? Those questions have been answered. We can now use this sandbox environment to explore other models and ask new questions.
What will future models look like? For starters, we can change the dataset. The current model uses team statistics, but overall team statistics cannot account for the presence or absence of specific players. We can build a player based model, where we build our explanatory dataset on each player’s performance, to account for this. One obvious advantage is we can account for players’ injuries.
In addition, we need to split our data into training, cross-validation, and test sets. Because of database issues, this was not possible when the linear model was built (again, to be discussed later). Our predictions serve as a test set of sorts. Each day’s games provide us between 1 and 15 tests against games not in the data set. But when the day changes, new data is scraped and the regression rerun. The next day’s game are predicted with a new model. With a true test set, we may train the model on the first 800 games of the season then check its performance on the following 200.
Future methods will require cross-validation data. A cross validation data set is the data we use to decide between models. Once we’ve decided on a model via the cross-validation data, we test the chosen model on the test data. Here, we had one model and didn’t worry too much about if it was right or not. It just happened to be.
Those future models will be machine learning and neural network models. They will bring a new array of challenges. But for now, we’ve got the trusty stead of statistics — linear regression — ready to predict any NBA game we desire.





