Can a Data Scientist Replace an NBA Scout? ML App Development for Best Transfer Suggestion

Pinterest LinkedIn Tumblr

Using the NBA API to create your own ML models and predict the best player transaction

Owing to the Greek Freak who recently reached the peak, I gave a chance to this project, which I kept latent the past few months. NBA it is!

The main scope hereof is to present an end-to-end ML app development procedure, which embodies quite a number of Supervised and Unsupervised algorithms, including Gaussian Mixtures Models (GMM)K-MeansPrincipal Component Analysis (PCA)XGBoostRandom Forest, and Multinomial Logistic Regression Classifiers.


After successfully clustering Whiskey varieties to boost a Vendor’s sales, Data Corp accepted a new project: assist the Milwaukee Bucks to make the best next move during the 2020 transaction window. That is, to pre-access the candidate players for the Shooting Guard (SG) position and buy the one who performs best. Being oblivious of Basketball knowledge leads me to a tricky alternative:

Low-code Application Development Company

How about requesting the NBA API, fetching player data from the past seasons’ games (e.g. assist to turnovers, assist % and so on), categorising them in a meaningful way for the General Manager (GM), and finally guide him on whom they should spend the transfer budget on?

To better communicate the outcomes, a couple of assumptions were made:

#1: We are at the end of the 2020 season (Oct). Bucks GM has prepared a list of 3 candidates for the SG position: Jrue HolidayDanny Green, and Bogdan Bogdanovic.

Submit Guest Post Data Analytics

#2: To accomplish the mission we have to uncover any insights from data which may lead the Bucks to increase their performance on the respective home ground of attacking (max assists, min turnovers etc), while preserving the rest of the stats (i.e. Weighted Field Goal %, etc). That is, we should not simply suggest the GM to buy the best passer or scorer, for this might compromise the rest valuable statistics.

Modus Operandi

  1. Build the dataset; fetch the player-wise statistics per game (from now on ‘plays’).
  2. Perform EDA; build intuition on the variables’ exploitation, come to earliest conclusions.
  3. Cluster ‘plays’ via K-Means GMM; reveal underlying patterns and identify the most suitable cluster for the case.
  4. Using the now labeled dataset (clusters = labels), train a number of Multi-class Classifiers, incl. Multinomial Logistic RegressionRandom Forest & XGBoost.
  5. Make Predictions on the candidate players’ latest ‘plays’ (2020 season) and benchmark them accordingly.
  6. Serve the trained models to the end-user, by building & serving an API (analysed in next post).
Workflow Chart (Image by author)

You can either run the notebooks for an explained workflow or the script files (.py) for an automated one.

1. Dataset

The dataset is built in 2 steps: (a) starting from this Kaggle dataset we query the basketball.sqlite to extract GAME_IDs for seasons 2017–2020, (b) we make requests to the NBA_api to fetch the player-wise data per game.

The whole procedure is wrapped up in the which you may choose to run, or else use the already prepared datasets in the ‘data/raw’ directory.

We use games from seasons 2017–2019 to train both clustering and classification models and keep 2020 data for testing purposes. Here is a sample of the dataset and an adequate explanation of the variables: dataset sample

In the vein of reducing cluttering, I do not delve into the data cleaning and pre-processing procedures — you may refer to 00_EDA.ipynb &, respectively.

2. EDA

[A thorough EDA is provided in the 00_EDA.ipynb]

We have to build intuition on what is really important, when it comes to access a SG’s performance. In this context, we classify features from the least to the most important one, based on domain knowledge. This will also make it easier to take the final decision.

# classify features by domain importance
group_2 = [MIN, AST_RATIO, DREB_PCT]
group_4 = [START_POSITION]
group_5 = [DEF_RATING]

Explained — Classified Features

In brief, all features are of high quality in terms of null presence, duplicated samples, or low-variance, while their boundaries make sense (no suspicious cases of unreasonable extreme values).

Features’ Histograms

However, many of them contain outliers to either of the sides. This is quite anticipated, as we deal with real game plays and no one (even the same player in different games) can always perform within a fixed performance ‘bracket’.

Features’ Whisker Box Plots

Concerning the crucial set of group_1 features, they are almost balanced between left/right-skewed. However, the dominant holding factor is the great presence of outliers beyond the pertinent upper boundary. There are many players who oftentimes perform well-above the expectations and this fact comes in line with our initial conclusion:

Induction #1: We have to deeply study group_1, in a way that will not only guarantee significant levels for the respective features, but also won’t compromise (the greatest possible number of) the rest.

With that in mind, we initiate a naive approach of sorting the dataset by a master feature (AST_PCT), taking the upper segment of it (95th Percentile) and evaluating the plays ‘horizontally’ (across all features). Descriptive Stats (Population) Descriptive Stats (95th Percentile)

The outcome is disappointing. By comparing the population with the 95th percentile average features, we see that by maximising along AST_PCT many of the remaining features get worse, violating that way Assumption #2. Besides, we wouldn’t like to buy a SG of great Assist ratio but poor Field Goal performance (EFG_PCT)!

Therefore, it gets easily conceived that we cannot accomplish our mission of building the optimum SG’s profile, based on plain exploratory techniques. Thus:

Induction #2: We have to build better intuition on the available data and use more advanced techniques, to effectively segment it and capture the underlying patterns, which may lead us to the best SG’s profile.

Clustering picks up the torch…

3. Clustering

[Refer to 01_clustering[kmeans_gmm].ipynb]


We begin with the popular K-Means algorithm, but firstly implement PCA, in order to reduce the dataset dimensions, while retaining most of the original features’ variance [1].

PCA ~ Explained Variance

We opt for a 4-component solution, as it explains at least 80% of the population’s variance. Next, we find the optimum # of clusters (k), by using the Elbow Method and plotting the WCSS line:

WCSS ~ Clusters Plot

The optimal # clusters is 4 and we are ready to fit K-Means.

K-Means Clusters

The resulted clustering is decent, however there are many overlapping points of cluster_2 and cluster_3, turquoise & blue, respectively. Seeking for potential enhancement, we are going to examine another clustering algorithm. This time not a distance-based, but a distribution-based one; Gaussian Mixture Models [2].


In general, GMM can handle a greater variety of shapes without assuming the clusters to be of the circular type (like K-Means does). Also, as a probabilistic algorithm, it assigns probabilities to the datapoints, expressing how strong their association is with a specific cluster. Yet, there’s no free lunch; GMM may converge quickly to a local minimum, hence deteriorating results. To tackle this, we can initialize them with K-Means, by tweaking the respective Class parameter [3].

In order to pick the suitable # of clusters, we can utilize the Bayesian Gaussian Mixture Models class in Scikit-Learn which weights clusters, leveling the erroneous ones at or near zero.

# returns
array([0.07, 0.19, 0.03, 0.14, 0.19, 0.09, 0.06, 0.18, 0.05, 0.01])

Obviously, only 4 clusters surpass the 0.01 threshold.

GMM Clusters

That’s it! cluster_3 (blue) is better separated this time, while cluster_2 (turquoise) is better contained, too.

Clusters Evaluation

For the purpose of enhancing the clusters assessment, we introduce a new variable which depicts the net score of the examined features. Each group is weighted in order to better express the magnitude it has on the final performance and their algebraic sum is calculated. I allocate weights as following:

NET_SCORE = 0.5*group_1 + 0.3*group_2 + 0.2*group_3 - 0.3*group_5# group_4 (START_POSITION) shouldn't be scored (categorical feature)
# being a center ‘5’ doesn't mean to be ‘more’ of something a guard ‘1’ stands for!# group_5 (DEF_RATING) is negative in nature
# it should be subtracted from the Net Score

So, let’s score and evaluate clusters.

GM Clusters scored by NET_SCORE

Apparently, cluster_3 outperforms the rest ones with a NET_SCORE of aprox. 662.49, while cluster_1 takes position next to it. But, what worths to be highlighted here is the quantified comparison between the 95th percentile and the newly introduced cluster_3:

NET_SCORE Whisker Box Plots for 95th percentile & cluster_3

It gets visually clear that cluster_3 dominates the 95th percentile segment, by noting an increase of the 146.5 NET_SCORE units! Consequently:

Induction #3Cluster_3 encapsulates those ‘plays’ which derive from great SG performance, in a really balanced way — group_1 features reach high levels, while most of the rest keep a decent average. This analysis, takes into account more features than the initially attempted (ref. EDA) which leveraged a dominant one (AST_PCT). Which proves the point that…

Induction #4: Clustering promotes a more comprehensive separation of data, deriving from signals of more components and along these lines we managed to reveal a clearer indication of what performance to anticipate from a top-class SG.

Now, we are able to manipulate the labelled (with clusters) dataset and develop a way to predict the cluster a new sample (unlabelled ‘play’) belongs to.

4. Classifiers

[Refer to 02_classifying[logres_rf_xgboost].ipynb]

Our problem belongs to the category of Multi-Class Classification and the first step to take is choosing a validation strategy to tackle potential overfitting.

# check for the clusters' balance
0 27508
1 17886
3 11770
2 5729

The skewed dataset implies that a Stratified K-fold cross-validation has to be chosen over a random one. This will keep the labels’ ratio constant in each fold and whatever metric we choose to evaluate, it will give similar results across them all [4]. And speaking of metrics, the F1 score (harmonic mean of precision and recall) looks more appropriate than accuracy, since the targets are skewed [5].

Next, we normalise data in order to train our (baseline) Logistic Regression model. Be mindful here to fit firstly on the training dataset and then transform both training and testing data. This is crucial to avoid data leakage [6]!

# returns
Mean F1 Score = 0.9959940207018171

Feature Importance

Such a tremendous accuracy from the very beginning is suspicious. Among the available ways to check the features’ importance (e.g. MDI), I choose the Permutation Feature Importance, which is model agnostic, hence we are able to use any conclusions to all the models [7].

Permutation Feature Importance for: (a) all features, (b) all ≠ START_POSITION , (c) all ≠ START_POSITIONMIN

START_POSITION contributes with extremely high importance (only by itself, scores F1=0.865). Should we check the pertinent descriptive statistics, we see that all group_1 features get the minimum level when START_POSITION is 0 (i.e. NaN). Descriptive Statistics

It betrays that those players didn’t start the game, so there is high possibility for them to have played for less time than the others, hence they have worse stats! The same applies for the MIN variable— it precisely expresses the time a player spent on court. Therefore both cause data leakage and we ignore them. Further to that, we distinguish the most significant features.

Feature Engineering

Additionally, we make an attempt to reduce the # of features by constructing a new, smaller number of variables which capture a significant portion of the original ones information. We put PCA in the spotlight once again, this time trying for 9 and 7 components. Be careful to only use the remaining normalised features (≠ START_POSITIONMIN)!

Eventually, we result in the following feature ‘buckets’:

all_feats   = [all] - [START_POSITION,MIN]
sgnft_feats = [all_feats] - [OFF_RATING,AST_TOV,PACE,PACE_PER40,PIE]
pca_feats = [pca x 9]
pca_feats = [pca x 7]

Hyperparameter Optimisation

After taking care of feature selection, we set for optimising each model’s hyperparameters. GridSearch is quite effective, albeit time-consuming. The procedure is similar to all models — for the sake of simplicity I only serve out the XGBoost case:

# returns
Best score: 0.7152999106187636
Best parameters set:
colsample_bytree: 1.0
lambda: 0.1,
max_depth: 3,
n_estimators: 200


Now, we declare the optimum hyperaparameters per model in the which dispatches the model we choose into the The latter wraps-up the whole training procedure, making it easier to train the tuned models with every feature ‘bucket’. We get:

## Logistic Regression ##   used               num_feats            F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7144
sgnft_feats | 11 | 0.7152
pca_feats | 9 | 0.7111 # sweet-spot
pca_feats | 7 | 0.7076## Random Forest ## used num_feats F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7213
sgnft_feats | 11 | 0.7145
pca_feats | 9 | 0.7100
pca_feats | 7 | 0.7049## XGBoost ## used num_feats F1_weighted
========= | ========= | ==========
all_feats | 16 | 0.7238 #best
sgnft_feats | 11 | 0.7168
pca_feats | 9 | 0.7104
pca_feats | 7 | 0.7068

Note: Your results may vary due to either the model’s stochastic nature or the numerical precision.

A classical performance vs simplicity trade-off is introduced; I choose the potential of Logistic Regression with the pca_feats (x9) to further proceed.

5. Predictions

Now, for the testing dataset’s plays, we predict their clusters by using the selected model.


For validation to happen, ground-truth labels are necessary. However, that is not our case as the testing dataset (test_proc.csv) is not labelled. You may wonder why we don’t label it via clustering, but that would lead us to the very same procedure, Cross Validation has already done 5 times—isolate a small portion of data and validate on that.

Instead, we are going to further evaluate the classifier by conducting qualitative checks. We can either manually review the labels of a portion of data to ensure they are good or compare the predicted to the training clusters and check that any dominant descriptive statistics still hold.

Predicted Clusters score by NET_SCORE

Indeed, cluster_3 takes again the lead by outperforming the rest with a NET_SCORE of 109.35 units, while noting the highest level along most of the crucial features (OFF_RATINGAST_PCTAST_TOV and POSS).


The last and most interesting part involves decision making. At first, we make predictions on the candidate players’ (Jrue HolidayDanny GreenBogdan Bogdanovic) first-half 2020 season ‘plays’ and label them with the respective cluster.

Then we check for their membership in the precious cluster_3, ranking them according to the respective ratio of cluster_3_plays / total_plays. So, we run the script and get:

# Results
'Jrue Holiday': 0.86,
'Bogdan Bogdanovic': 0.38,
'Danny Green': 0.06

And guess what?

On November 24th of 2020, Bucks officially announced Jrue Holiday’s transaction! You thought so; an out-of-reality validation…


We have come a long way so far… Starting from Kaggle & NBA API we built a vast dataset, clustered it and revealed insightful patterns on what it takes to be a really good Shooting Guard. We, then, trained various Classification models on the labelled dataset, predicting with decent accuracy the Cluster a new player entry may be registered in. By doing so, we managed to spotlight the next move Milwaukee Bucks should (and did!) take, to fill the SG position.

Similarly to the DJ vs Data Scientist case, it’s quasi-impossible to assertively answer the potential of Data Science in the Scouting Field. Yet, once again the signs of the times denote a favourable breeding ground for AI implementation in the decision-making field of the Sport Industry…

Photo by Patrick Fore on Unsplash

I dedicate this project to my good friend Panos — an ardent fan of Basketball, astronomy aficionado and IT expert.

Thank you for reading & have a nice week! Should any question arise, feel free to leave a comment below or reach me out on Twitter/LinkedIn. In any case…





[4] A. Thakur, Approaching (Almost) Any Machine Learning Problem, 1st edition (2020), ISBN-10‏: ‎9390274435




Original Source

Aeronautical QA Engineer at HAF | Editorial Associate & Writer @TDataScience | Electronics Eng HAA | MBA HoU |