Every talking-head is going to be talking about how <insert team name here> will be sure to advance to the NCAA Men’s College Basketball Finals. I’m here to tell you their process for choosing that team is flawed. The most accurate predictions will be made in the sports betting markets with strong statistical reasoning. I will share with you an example of how an pro bettor might approach handicapping the two games this weekend.

I must preface the following by stating very clearly: THIS IS NOT BETTING ADVICE.

The prediction at the end of this post will not have real value. The data used is too small. The features selected are not nearly robust enough to produce a prediction that is more accurate than the market. My goal is only to illustrate the process. With that in mind, buckle in and lets get started!

This will be a 3 phase approach.

Our path will start at the data collection phase. We will scrape data from sports-reference.com, a popular hub for box scores and statistics on many popular spots.
We will explore the data collected and examine the relationships various features have to each other and to the ultimate result, wins and losses.
Make a prediction on both games.

A link to my GitHub with all the code for this exercise can be found here.

Data Collection

Lets get scraping! We navigate to Sports-Reference. ** Note ** please be respectful of SR’s servers. Do not overload them with requests to gather all their data and do consider donating to help support their server costs.

We want to find this table with our web browser’s developer tools.

Screen shot of table to grab and it's location in the page's HTML source code.

Next we will find the table headers to get a list of all the stat names we wish to collect. We will use Beautiful Soup to sort through everything. Check out the Jupyter Notebook on my GitHub for more specific details.

We will grab all the box scores for the four teams still competing in the tournament. North Carolina, Duke, Villanova, and Kansas. This will give us data for how they have performed all season. This is going to be a small subset of the data necessary to make a robust prediction. Ideally we would have data from multiple years and on every team in the conferences that the target 4 schools compete in: Big Easy, Big 12, and the ACC. Today, we will only be looking at these 4 teams and their box scores.

After collecting the data in the step above, it will need to be checked for inconsistencies and data type. A lot of data scraped from the web will arrive in a ‘string’ format (as text).

Quick view of data types collected from the web. All dtypes are "objects". We want many of them to be numeric.

Dates will need to be converted to datetime.

Game location is currently categorized as <empty> for home games, ‘@’ for away games, and ‘N’ for neutral location.

Game location categories in need of cleaning.

The game result column also lists overtime results, which are less important for our given situation. Game result will also need to be binarized as it will be our ‘y’ or our target variable.

The remaining columns can now be converted to a numeric format. Since our data set is very small, choosing the optimal data type is less impactful. If we were planning on building a robust data set, identifying specific data types could be relevant at this step for time and space complexity.

Convert remaining columns to numeric data types.

Our data is not missing any values and is now “clean.” Time to move on to phase 2!

Exploratory Data Analysis

We have about 40 columns, but will only be looking a subset of them today.

Time to make a heat map of the correlation matrix. A correlation matrix will allow us to view the relationships each feature has with all of the other features in our data set. Adding a heat map component will make this large mass of data easier to contextualize. We can pick out the more vibrant colors to and start analyzing the data more closely. This is a lot to digest. It is best to take it one row or column at a time.

Combing through the data can take some time. I isolated a few relationships I think are worth exploring in more detail. A few obvious relationships like fg3 pct and fg pct are naturally directly correlated, so having some basic understanding of the data we are looking at is critical to selecting predictive features.

The features with potentially interesting relationships to each other or game result, in my opinion, are the following:

Interesting features: ['school', 'game_result', 'pts', 'opp_pts', 'fg', 'fg_pct', 'fg3_pct', 'trb', 'ast', 'stl', 'opp_fg', 'opp_fg_pct', 'opp_fg3_pct', 'opp_ft', 'opp_trb', 'opp_ast']

Prepare yourself for another wall of relational diagrams coming at you. We want our features to be reasonably normally distributed and for there to be some form of relationship to other features in our data set. Histograms will show the distribution of the data. Density plots will show how the metric behaves in a win vs a loss. Scatter plots with a trend line will hopefully give us some idea of the direction of the data. Brace yourself. A good bettor will really take their time to break this step down. Don’t rush it. This will be a critical step for feature selection. (There are other methods to accomplish this that I will not be getting into today.)

Histograms, density plots, and scatter plots with trend lines for all features and their relationship to each other.

The goal is to narrow in on some correlations that are going to be worth exploring in more detail. I always start by looking at the distributions. In a perfect world, they will all be normal-ish (symmetric). None of the histograms have overly heavy tails (a skew toward the left or right), so our data should be reasonably easy to find connections.

Next, I like to move to the density plots on the top right side of the chart. I'm mostly looking for oblong shapes with either strong positive or negative correlations. I will list a few that stand out at a glance.

ast -- pts
opp_ast -- opp_pts
fg -- ast
opp_fg -- opp_ast

I'm omitting obvious correlations like pts and fg_pct.

Next I will look for connections where the orange (wins) and blue (losses) densities are separated or stacked in a clear way.

pts -- trb
opp_pts -- opp_trb
opp_pts -- stl
opp_pts -- trb
trb -- opp_trb
ast -- stl
opp_trb -- opp_fg_pct

Last I will look over the scatter/trend lines in the bottom left section of the chart. Orange and blue lines with different slops are of interest. Will skip if they appear in the above section.

ast - fg_pct

It should be fairly obvious that points, field goal percentage, and wins are directly, related, but lets make a quick illustration to cement it.

Now on to rebounding!

Rebounds are an interesting connection for a few reasons. First, a rebound can only happen on a missed shot. This can be a little misleading in that missed shots happen on a high volume of opponent shots too. If the opponent is attempting a lot of shots, it can also mean they are making many shots too. Rebounds will need to be explored more. Will also want to separate offensive and defensive rebounds to help bring the picture into focus.

Total, defensive, and offensive rebounding's relationships to wins and losses

Fitting rebounds to a simple logistic regression model illustrates the clear relationship both offensive and defensive rebounds have to wins and losses. Offensive rebounds having the most noise makes sense in that they are less frequent. However, if a team can capture north of 40% of the offensive rebound opportunities, they will likely have a good chance of winning the game.

Next, lets look at how rebounding impacts points scored and shot attempts.

If we only look at total rebounds and field goals made, we don't see a clear relationship. The data is quite noisy. However, the picture comes into focus a little more if we compare rebounds to field goal attempts.

Naturally field goals attempted will go up drastically as offensive rebounds increase. The shooting team will likely get another attempt quickly in this scenario. Giving the opposing team less time to steal the ball or force a turnover. The slopes level out for defensive and total rebounds, but we still clearly have steeper slopes in their trend lines.

Now onto assists!

Assists have a strong correlation to points scored and ultimately winning. Intuitively, this makes sense. Great college teams tend to play good "team" ball. Where teamwork outweighs individual talent. Ball movement is also assumed to lead to an increase in open shot opportunities. Difficult to prove this without x/y coordinate data of the players, but it passes the sniff test.

Steals are the inverse of assists. Steals often happen on poorly thrown passes (a potentially failed assist). So seeing the relationship between wins and assists/steals is likely solid.

Assists appeared to show a connection to points. Lets see if we can drill into that a bit more by normalizing the relationship between assists and field goals. We can do this by dividing the number of assists by the numbers of field goals made. We cannot compare the number of assists to field goal attempts because we do not know how many missed field goals would have been assisted had the shot gone in.

Graphs of assists and their relationship to points, field goal %, and game result

If we only looked at the raw number of assists, well of course the more assists will naturally mean more points. More points means there were more opportunities for assists. Normalizing the data to be a proportion of shots made that were assisted will provide a more realistic picture for what is happening on the court. In each category, the numbers come back down to earth, but still show a clear correlation to points scored, field goal percentage, and wins.

Those familiar with basketball analytics are probably shouting at their screens: "FOUR FACTORS!!".. And they would be correct. Four factors are, in fact, great measures of overall team performance and really set the foundation for more advanced basketball analytics. The fourth “four factors” is free throws, but I will not be going into it here.

To recap, we can see that field goal percentage is correlated to number of points scored. Number of points scored is correlated to wins. Nothing ground breaking here.

Controlling the boards with a higher rebound percentage will lead to extended or additional possessions, leading to more field goal attempts. More field goal attempts means more points. More points, higher chance of winning.

Teams that pass the ball more often will have an increase in field goal percentage. Increase in field goal percentage, you guessed it, increase in wins.

Now lets check our desired metrics against each other. We will want to select metrics that are not highly positively or negatively correlated, if possible.

Smaller heat map of correlation matrix with reduced features

Looking at field goal & field goal percentage, a strong connection exists (.77). We will not want to select both of these. Field goal percentage has the highest correlation to game result, so it will likely be the best option.

If we look at our rebounding numbers we can see there is a small negative correlation between field goal percentage and total rebounds & offensive rebounds, but almost no relationship to offensive and defensive rebound percentages. This is great! Offensive and defensive rebounding percentage are high correlated to total rebounds and total rebound percentage (of course). Total rebound percentage is the strongest correlation we have to game result. Rebounding is clearly an important data point and offensive and defensive rebound percentage could be the best option, given their independence from our other data points. Further testing will be necessary to decide if the raw number is best or if the percentage shows the most signal. But we know we are on to something. It is possible we see the best results from defensive rebounds and offensive rebound percentage.
Last, lets take a look at assists and their relationship to our various field goals and rebounds categories. As we learned earlier, assists are going to be correlated to field goal percentage. This will be unavoidable. But the percentage of assists is less correlated, so it will be a good option to explore more while testing models. Assist percentage, relative to raw assists is less correlated to both rebounds and rebound percentages, but we also see a significant drop off in correlation to game result. It will be worth testing both during the modeling stage.

It should be reasonably safe to say, field goal percentage, some variation of assists and rebounds provide some signal of what happened in the game leading to a win. However, we will not know the box score of the game until after the game occurs. This doesn't do us any good if our goal is to attempt to predict what might happen in future games. We will first want to check if the past is a good predictor of the future. There are numerous more complex ways to accomplish this with forecasting. ARIMA, SARIMA, RNN, VAR are some letters you can google if interested in learning more about forecasting methods.

I will be using some simple averages to test if we have reasonable predictors on our feature options.

Predicting

Time to make some baseline predictions for the feature options. We will only be looking at the raw numbers for our predictions as the percentages will be directly correlated and we cannot average the percentages together. Some games will have 80 FGA while others only 50. Shooting percentage per game does not have an equal weight game over game due to the field goal attempt discrepancy. We could calculate them, but the goal is to keep things simple here.

List of baseline predictions for our desired features.

Baseline will be a combined simple average for all our features for all teams.

Now we will check to see if the first half of the year can predict the second half with a lower root mean squared error than our baseline prediction. RMSE is calculated by subtracting the baseline prediction (our means) from the actual value; also known as the “residual.” The residual is squared a multiplied by 1/n where n is the number of observations. This is a our mean squared error (MSE). To put the error term back onto the same scale as our data we take the square root of the MSE. This is the root mean squared error or RMSE.

RMSE formula is listed below:

Ideally, our residuals will be normally distributed. The data set is very small, so slight deviations from a normal distribution will be expected. Check out the Jupyter Notebook for visualizations of our residuals.

Our baseline residuals mostly look normally distributed. Offensive rebounds is a little wonky. We might not get the best signal from it.

If our desired metric is predictive, we will have a lower RMSE in our prediction than in our baseline error. The 4 schools have played in 37 and 38 games. So we will check to see if the first 19 games can predict the last 18 or 19 more accurately than the baseline model.

With some Python magic, the calculations are made for second half predictions. Full code in the notebook. Our RMSE for the second half predictions are compared to the baseline predictions.

Given how small the sample size of our data set is, I'm mostly okay with these numbers. They're all very close to the baseline. No major outliers. They are not great predictors, but they are also not terrible. We used an overly simplistic forecasting method.

Check out the notebook for the visualization of the residuals.

Assists and total rebounds are a little shaky, but they look close enough. Again, really small sample size, so don't want to get too bogged down on this. Distributions looked normal-ish.

Now, lets do an expanding/rolling mean for all the games. In other words, We will take the mean of all the past games to predict the next game's feature. For example, in game number 4 for field goal attempts, we will use the mean from games 1, 2, and 3 as the projection for field goal attempts in game 4. The prediction for game 5 will be the mean of games 1, 2, 3, and 4. We will do this for all n games for all our features. Given how fresh the data is, it is likely this will yield the best results of the 3 options attempted.

We summon some more Python magic, shift our data down and take the rolling averages.

Again, residuals visualizations in the notebook. These distributions look better than the previous two! We're headed in the right direction.

Our rolling forecasted features have the lowest RMSE of the 3 we tested. This is ideal.

Modeling

Our data has a time component, so shuffling the data is not a viable option. These teams have a very high win percentage (they are in the final four, after all), so we need to have a training data set that includes both wins and losses. Preferably 50:50 split. In a traditional handicapping environment, this data would be gathered over multiple years and for many more than 4 teams. So we're painted into a bit of a corner here. But, we power on.

The goal is to produce a "power ranking" or a relative weight of performance for our 4 remaining teams by using the features we have tracked to this point. In a nutshell, how has past performance in the assists, field goal attempts, field goal percentage, and rebounding, related to wins and losses. We will then compare the relative weights of our features for both teams in both match ups by using a logistic regression estimator.

There are better estimators and certainly a wider range of features that can and should be tested before a model can compete with the prediction that is the point spread or moneyline odds set by the casino/sports book.

The predictions that are soon to follow WILL NOT BE PROFITABLE AND SHOULD NOT BE WAGERED ON BASED ON THIS ANALYSIS.

To restate, my intention with this mini-project is to illustrate the process a professional sports bettor will go through and what data tools are in their tool belt. To restate it for the record:

DO NOT BET ON THESE GAMES BASED ON THIS ANALYSIS.

Moving on..

Data split in half for reasons stated above and normalized. It is important to scale the data after splitting it so the testing data does not influence the training data. We cannot use data from the "future" to make predictions on the future without a time machine.

Logistic regression will return a probability. Since our model features did not include opponent data, only team specific data, the prediction will be an estimate vs an "average" team. This probability will then be used as a weight. Think of it this way. If Team A has a 60% chance to win against an average team and Team B has a 50% chance to win against an average team, then we can compare the relative weights and come up with an adjusted win percentage for both teams. It should also be noted that our y test (wins) is significantly imbalanced (83.8% average win rate) and we do not have a large enough data set to perform bootstrapping techniques to return balance. But, we only have the data that we have, so lets make hay while the sun is shining.

We start by taking Team A's probability of victory against an average team (the number our model predicts) and dividing it by the sum of the probabilities of both teams.

The weights for all 4 teams, as predicted by our model, are as follows:

Here we have our relative weights or "power rankings". Had we started with a larger and more robust data set, these numbers would likely be much higher. But we plan to compare them to each other, so we're making an apples-to-apples comparison. Their specific value is not important in this instance.

Match Up Analysis

Duke is playing UNC.

Duke's weight is .357121. UNC's weight is 0.323860. We estimate Duke's odds of winning by dividing Duke's weight by the sum of both UNC & Duke’s weights.

Duke weight / (Duke weight + Villanova weight)

.357121 / (.357121 + .323860) = .5244

The model predicts Duke will have a 52.4% chance to win against Villanova. (Again, small data, do not bet based on this number!!)

We do the same for UNC and get an estimation of 47.6% or 1 - Duke's odds to win. Either way, we get to the same place.

Now on to Villanova & Kansas

Villanova weight / (Villanova weight + Kansas weight)
.616097 / (.616097 + .486650) = .5587

The model predicts Villanova has an estimated win rate of 55.9%.

Market Comparison

Lastly, we compare these numbers to the market price. As of Friday morning (the day before the game), the market had the following odds for all 4 teams:

We can convert these to a percentage. I will write a blog on how to do this mathematically in the future, but for now, lets use SBR's odds converter. We will be going from "American Odds" to "Implied Probability."

The market's implied probability for the 4 teams are as follows:

Villanova -- 37.0%
Kansas -- 63.6%
UNC -- 37.0%
Duke -- 65.5%

Note, the percentages do not add up to 100%. This discrepancy is how the casino makes a profit. It is our job to "overcome" this number. Our models have to not only be more predictive than the market price, but they have to be more predictive than the market price + the "juice" or house "vigorish."

Lets restate our predictions:

Villanova -- 55.8%
Kansas -- 44.1%
UNC -- 47.6%
Duke -- 52.4%

To revisit estimated value from my previous blog post we know that we take the amount won multiplied by the probability of winning and subtract the amount lost by the probability of losing.

Lets start with Villanova. $100 risked will return $170 (plus the original wager). A wager of $175 on Kansas will return $100.

(Amount to win * Odds Villanova wins) - (Amount to lose * Odds Kansas wins)

(170 * .559) - (100 * .441) = $50.93

Villanova is positive expected value!! (DON’T BET BASED ON THIS ANALYSIS, ILLUSTRATION PURPOSES ONLY)

Naturally, Kansas would be 1 - Villanova's expected value, with the vigorish reducing the estimated value further.

Lets do the same for UNC.

(Amount to win * Odds UNC wins) - (Amount to lose * Odds Duke wins)
(170 * .476) - (100 * .524) = $28.52

Again.... do not make a wager based on this number. Seriously. Don't do it!

Takeaways

Our overly simple model likes both underdogs! It likes Villanova almost twice as much as it likes UNC. The model would suggest there is more value backing Villanova than UNC, but both favorites are significantly overvalued. Professional bettors typically find themselves on the side of the underdog for many reasons that I will not get into in this piece.

It is unlikely that both underdogs will win, but a wager on both means only one has to win for us to walk away happy at the end of the day. A professional bettor does not judge success on one win or one loss, but rather on the entire body of work. Even if both underdogs lose, the bettor knows they took positive expected value positions on both games. Over a sample size of 100, 200, 1000+ wagers the professional bettor will come out profitable, assuming their predictions are better calibrated than the market's prediction.

This is not intended to be betting advice. It is purely an exercise in the processes taken to estimate relative strengths between teams.

— Gene is a Seattle Data Scientist whose goal is to turn complex modeling techniques into actionable insights.

An analytical approach to predicting the NCAA Final Four