Using the NBA API to create your own ML models and predict the best player transaction
Owing to the Greek Freak who recently reached the peak, I gave a chance to this project, which I kept latent the past few months. NBA it is!
The main scope hereof is to present an end-to-end ML app development procedure, which embodies quite a number of Supervised and Unsupervised algorithms, including Gaussian Mixtures Models (GMM), K-Means, Principal Component Analysis (PCA), XGBoost, Random Forest, and Multinomial Logistic Regression Classifiers.
After successfully clustering Whiskey varieties to boost a Vendor’s sales, Data Corp accepted a new project: assist the Milwaukee Bucks to make the best next move during the 2020 transaction window. That is, to pre-access the candidate players for the Shooting Guard (SG) position and buy the one who performs best. Being oblivious of Basketball knowledge leads me to a tricky alternative:
How about requesting the NBA API, fetching player data from the past seasons’ games (e.g. assist to turnovers, assist % and so on), categorising them in a meaningful way for the General Manager (GM), and finally guide him on whom they should spend the transfer budget on?
To better communicate the outcomes, a couple of assumptions were made:
#1: We are at the end of the 2020 season (Oct). Bucks GM has prepared a list of 3 candidates for the SG position: Jrue Holiday, Danny Green, and Bogdan Bogdanovic.
#2: To accomplish the mission we have to uncover any insights from data which may lead the Bucks to increase their performance on the respective home ground of attacking (max assists, min turnovers etc), while preserving the rest of the stats (i.e. Weighted Field Goal %, etc). That is, we should not simply suggest the GM to buy the best passer or scorer, for this might compromise the rest valuable statistics.
- Build the dataset; fetch the player-wise statistics per game (from now on ‘plays’).
- Perform EDA; build intuition on the variables’ exploitation, come to earliest conclusions.
- Cluster ‘plays’ via K-Means & GMM; reveal underlying patterns and identify the most suitable cluster for the case.
- Using the now labeled dataset (clusters = labels), train a number of Multi-class Classifiers, incl. Multinomial Logistic Regression, Random Forest & XGBoost.
- Make Predictions on the candidate players’ latest ‘plays’ (2020 season) and benchmark them accordingly.
- Serve the trained models to the end-user, by building & serving an API (analysed in next post).
You can either run the notebooks for an explained workflow or the script files (.py) for an automated one.
The dataset is built in 2 steps: (a) starting from this Kaggle dataset we query the
basketball.sqlite to extract
GAME_IDs for seasons 2017–2020, (b) we make requests to the NBA_api to fetch the player-wise data per game.
We use games from seasons 2017–2019 to train both clustering and classification models and keep 2020 data for testing purposes. Here is a sample of the dataset and an adequate explanation of the variables:https://towardsdatascience.com/media/c31705581307f274aef95f615c67c986
plays_df dataset sample
[A thorough EDA is provided in the 00_EDA.ipynb]
We have to build intuition on what is really important, when it comes to access a SG’s performance. In this context, we classify features from the least to the most important one, based on domain knowledge. This will also make it easier to take the final decision.
# classify features by domain importance
group_1 = [OF_RATING,AST_PCT,AST_TOV,TM_TOV_PCT,EFG_PCT,TS_PCT,POSS]
group_2 = [MIN, AST_RATIO, DREB_PCT]
group_3 = [OREB_PCT, REB_PCT, USG_PCT, PACE, PACE_PER40, PIE]
group_4 = [START_POSITION]
group_5 = [DEF_RATING]
Explained — Classified Features
In brief, all features are of high quality in terms of null presence, duplicated samples, or low-variance, while their boundaries make sense (no suspicious cases of unreasonable extreme values).
However, many of them contain outliers to either of the sides. This is quite anticipated, as we deal with real game plays and no one (even the same player in different games) can always perform within a fixed performance ‘bracket’.
Concerning the crucial set of
group_1 features, they are almost balanced between left/right-skewed. However, the dominant holding factor is the great presence of outliers beyond the pertinent upper boundary. There are many players who oftentimes perform well-above the expectations and this fact comes in line with our initial conclusion:
Induction #1: We have to deeply study
group_1, in a way that will not only guarantee significant levels for the respective features, but also won’t compromise (the greatest possible number of) the rest.
With that in mind, we initiate a naive approach of sorting the dataset by a master feature (
AST_PCT), taking the upper segment of it (95th Percentile) and evaluating the plays ‘horizontally’ (across all features).https://towardsdatascience.com/media/bb968ab9ae591f4c20d6cd37c36eb303
plays_df Descriptive Stats (Population)https://towardsdatascience.com/media/ccdf052d06db83db9a8f0585180ed1c3
plays_df Descriptive Stats (95th Percentile)
The outcome is disappointing. By comparing the population with the 95th percentile average features, we see that by maximising along
AST_PCT many of the remaining features get worse, violating that way Assumption #2. Besides, we wouldn’t like to buy a SG of great Assist ratio but poor Field Goal performance (
Therefore, it gets easily conceived that we cannot accomplish our mission of building the optimum SG’s profile, based on plain exploratory techniques. Thus:
Induction #2: We have to build better intuition on the available data and use more advanced techniques, to effectively segment it and capture the underlying patterns, which may lead us to the best SG’s profile.
Clustering picks up the torch…
[Refer to 01_clustering[kmeans_gmm].ipynb]
We begin with the popular K-Means algorithm, but firstly implement PCA, in order to reduce the dataset dimensions, while retaining most of the original features’ variance .https://towardsdatascience.com/media/ac9d0ebb50773a56aa1c73d433e95228
We opt for a 4-component solution, as it explains at least 80% of the population’s variance. Next, we find the optimum # of clusters (k), by using the Elbow Method and plotting the WCSS line:https://towardsdatascience.com/media/cb6c8965cc4db4a92f9f92668913cbc5
The optimal # clusters is 4 and we are ready to fit K-Means.https://towardsdatascience.com/media/1b106a10fe5d43dac42104c4b15076e2
The resulted clustering is decent, however there are many overlapping points of
cluster_3, turquoise & blue, respectively. Seeking for potential enhancement, we are going to examine another clustering algorithm. This time not a distance-based, but a distribution-based one; Gaussian Mixture Models .
In general, GMM can handle a greater variety of shapes without assuming the clusters to be of the circular type (like K-Means does). Also, as a probabilistic algorithm, it assigns probabilities to the datapoints, expressing how strong their association is with a specific cluster. Yet, there’s no free lunch; GMM may converge quickly to a local minimum, hence deteriorating results. To tackle this, we can initialize them with K-Means, by tweaking the respective Class parameter .
In order to pick the suitable # of clusters, we can utilize the Bayesian Gaussian Mixture Models class in Scikit-Learn which weights clusters, leveling the erroneous ones at or near zero.https://towardsdatascience.com/media/4ceeb0e2c14b6176a8f638e5259c3036
array([0.07, 0.19, 0.03, 0.14, 0.19, 0.09, 0.06, 0.18, 0.05, 0.01])
Obviously, only 4 clusters surpass the 0.01 threshold.https://towardsdatascience.com/media/396080fda9be642f408c260b2116b6ebgmm.py
cluster_3 (blue) is better separated this time, while
cluster_2 (turquoise) is better contained, too.
For the purpose of enhancing the clusters assessment, we introduce a new variable which depicts the net score of the examined features. Each group is weighted in order to better express the magnitude it has on the final performance and their algebraic sum is calculated. I allocate weights as following:
NET_SCORE = 0.5*group_1 + 0.3*group_2 + 0.2*group_3 - 0.3*group_5# group_4 (START_POSITION) shouldn't be scored (categorical feature)
# being a center ‘5’ doesn't mean to be ‘more’ of something a guard ‘1’ stands for!# group_5 (DEF_RATING) is negative in nature
# it should be subtracted from the Net Score
So, let’s score and evaluate clusters.https://towardsdatascience.com/media/2e8eeb54ee21a6fc20d94d5f9de8c555
GM Clusters scored by
cluster_3 outperforms the rest ones with a
NET_SCORE of aprox. 662.49, while
cluster_1 takes position next to it. But, what worths to be highlighted here is the quantified comparison between the 95th percentile and the newly introduced
It gets visually clear that
cluster_3 dominates the 95th percentile segment, by noting an increase of the 146.5
NET_SCORE units! Consequently:
Cluster_3encapsulates those ‘plays’ which derive from great SG performance, in a really balanced way —
group_1features reach high levels, while most of the rest keep a decent average. This analysis, takes into account more features than the initially attempted (ref. EDA) which leveraged a dominant one (
AST_PCT). Which proves the point that…
Induction #4: Clustering promotes a more comprehensive separation of data, deriving from signals of more components and along these lines we managed to reveal a clearer indication of what performance to anticipate from a top-class SG.
Now, we are able to manipulate the labelled (with clusters) dataset and develop a way to predict the cluster a new sample (unlabelled ‘play’) belongs to.
[Refer to 02_classifying[logres_rf_xgboost].ipynb]
Our problem belongs to the category of Multi-Class Classification and the first step to take is choosing a validation strategy to tackle potential overfitting.
# check for the clusters' balance
The skewed dataset implies that a Stratified K-fold cross-validation has to be chosen over a random one. This will keep the labels’ ratio constant in each fold and whatever metric we choose to evaluate, it will give similar results across them all . And speaking of metrics, the F1 score (harmonic mean of precision and recall) looks more appropriate than accuracy, since the targets are skewed .https://towardsdatascience.com/media/4bcedc0d45b6d030b9b2f012c34cf314
Next, we normalise data in order to train our (baseline) Logistic Regression model. Be mindful here to fit firstly on the training dataset and then transform both training and testing data. This is crucial to avoid data leakage !https://towardsdatascience.com/media/42d2a6a0f35b58649fe1c18ce2af5c1b
Mean F1 Score = 0.9959940207018171
Such a tremendous accuracy from the very beginning is suspicious. Among the available ways to check the features’ importance (e.g. MDI), I choose the Permutation Feature Importance, which is model agnostic, hence we are able to use any conclusions to all the models .https://towardsdatascience.com/media/5c7b5b455e41a8d384f39c520bc279eb
START_POSITION contributes with extremely high importance (only by itself, scores F1=0.865). Should we check the pertinent descriptive statistics, we see that all
group_1 features get the minimum level when
START_POSITION is 0 (i.e. NaN).
START_POSITION Descriptive Statistics
It betrays that those players didn’t start the game, so there is high possibility for them to have played for less time than the others, hence they have worse stats! The same applies for the
MIN variable— it precisely expresses the time a player spent on court. Therefore both cause data leakage and we ignore them. Further to that, we distinguish the most significant features.
Additionally, we make an attempt to reduce the # of features by constructing a new, smaller number of variables which capture a significant portion of the original ones information. We put PCA in the spotlight once again, this time trying for 9 and 7 components. Be careful to only use the remaining normalised features (≠
Eventually, we result in the following feature ‘buckets’:
all_feats = [all] - [START_POSITION,MIN]
sgnft_feats = [all_feats] - [OFF_RATING,AST_TOV,PACE,PACE_PER40,PIE]
pca_feats = [pca x 9]
pca_feats = [pca x 7]
After taking care of feature selection, we set for optimising each model’s hyperparameters. GridSearch is quite effective, albeit time-consuming. The procedure is similar to all models — for the sake of simplicity I only serve out the XGBoost case:
Best score: 0.7152999106187636
Best parameters set:
Now, we declare the optimum hyperaparameters per model in the model_dipatcher.py which dispatches the model we choose into the train.py. The latter wraps-up the whole training procedure, making it easier to train the tuned models with every feature ‘bucket’. We get:
## Logistic Regression ## used num_feats F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7144
sgnft_feats | 11 | 0.7152
pca_feats | 9 | 0.7111 # sweet-spot
pca_feats | 7 | 0.7076## Random Forest ## used num_feats F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7213
sgnft_feats | 11 | 0.7145
pca_feats | 9 | 0.7100
pca_feats | 7 | 0.7049## XGBoost ## used num_feats F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7238 #best
sgnft_feats | 11 | 0.7168
pca_feats | 9 | 0.7104
pca_feats | 7 | 0.7068
Note: Your results may vary due to either the model’s stochastic nature or the numerical precision.
A classical performance vs simplicity trade-off is introduced; I choose the potential of Logistic Regression with the pca_feats (x9) to further proceed.
Now, for the testing dataset’s plays, we predict their clusters by using the selected model.
For validation to happen, ground-truth labels are necessary. However, that is not our case as the testing dataset (
test_proc.csv) is not labelled. You may wonder why we don’t label it via clustering, but that would lead us to the very same procedure, Cross Validation has already done 5 times—isolate a small portion of data and validate on that.
Instead, we are going to further evaluate the classifier by conducting qualitative checks. We can either manually review the labels of a portion of data to ensure they are good or compare the predicted to the training clusters and check that any dominant descriptive statistics still hold.
Predicted Clusters score by
cluster_3 takes again the lead by outperforming the rest with a
NET_SCORE of 109.35 units, while noting the highest level along most of the crucial features (
The last and most interesting part involves decision making. At first, we make predictions on the candidate players’ (Jrue Holiday, Danny Green, Bogdan Bogdanovic) first-half 2020 season ‘plays’ and label them with the respective cluster.
Then we check for their membership in the precious
cluster_3, ranking them according to the respective ratio of
total_plays. So, we run the
predict.py script and get:
'Jrue Holiday': 0.86,
'Bogdan Bogdanovic': 0.38,
'Danny Green': 0.06
And guess what?
On November 24th of 2020, Bucks officially announced Jrue Holiday’s transaction! You thought so; an out-of-reality validation…
We have come a long way so far… Starting from Kaggle & NBA API we built a vast dataset, clustered it and revealed insightful patterns on what it takes to be a really good Shooting Guard. We, then, trained various Classification models on the labelled dataset, predicting with decent accuracy the Cluster a new player entry may be registered in. By doing so, we managed to spotlight the next move Milwaukee Bucks should (and did!) take, to fill the SG position.
Similarly to the DJ vs Data Scientist case, it’s quasi-impossible to assertively answer the potential of Data Science in the Scouting Field. Yet, once again the signs of the times denote a favourable breeding ground for AI implementation in the decision-making field of the Sport Industry…
I dedicate this project to my good friend Panos — an ardent fan of Basketball, astronomy aficionado and IT expert.
 A. Thakur, Approaching (Almost) Any Machine Learning Problem, 1st edition (2020), ISBN-10: 9390274435