How the big picture gives us insights and a better understanding of ML by connecting the dots
There are so many machine learning algorithms out there, and we can find different kinds of overviews and cheat sheets. Why another overview?
I tried to build this different overview with these three main focuses in mind:
- a complete hierarchical relationship for all algorithms
- relationships between apparently different algorithms — for example, LDA vs Logistic regression, Gradient Boosting Machine vs Linear regression, SVM vs linear regression, etc.
- a structure to better answer usual ML questions — for example, the effect of numerical scaling, the feature importance, linear model vs non-linear model.
When trying to connect the dots of the multitude of machine learning algorithms, I discovered is that there are often several approaches to build/understand an algorithm. For example:
- The standard Linear regression can be seen as OLS (Ordinary Least Squares) Regression and is equivalent to the solution found by MLE (Maximum Likelihood Estimation) with a normal distribution for the residuals.
- Logistic regression can be defined with odd ratios but is also seen as a smoothed version of linear regression.
- SVM is almost always defined as a (soft) margin maximizer and this gives us the impression that the approach is very different from other algorithms. Whereas it has the same equation as the linear regression with another loss function.
There is no better way, and each approach allows us to understand one aspect. One approach can be easier than another depending on what you already know. My aim in this article is to find a meaningful way to connect them. This also gives me a lot of interesting insights that helped me to better understand them. This overview also becomes a tool, a framework to go deeper for each algorithm.
Each discussion with fellow data scientists allows me to discover more. I will continue to improve it, and your comments are very welcome.
The Big Picture
The idea of this first overview is to map the most common algorithms. Globally, they are organized in a hierarchical structure, with three main categories. And there are also links between algorithms in different categories that we will discuss later.
- The first category contains Nearest Neighbors algorithms. The idea is to find the most similar observations by defining a distance for a new observation to make predictions. There are mainly two ways to find neighbors: we fix the number of neighbors (it is the most common way, and it is called the K-Nearest-Neighbors algorithm), or we can define a fixed radius (Radius Neighbors algorithm is less used, but it can be considered as more intuitive and we can find pictures illustrating radius neighbors for KNN on the internet).
- The second category contains Decision Tree based algorithms. A simple decision tree is usually not performant enough, and the ensemble methods (in theory, they also can be applied to other categories of algorithms, but in practice, it is very often used for decision trees) can be used to aggregate more trees. Bagging consisted of building trees with bootstrapped samples with aggregating them (by calculated average values of all trees). RandomForest introduces another randomness: when choosing variables to create splits, fewer predictors are randomly chosen. Boosting techniques consist of adding weak learners and the common examples are AdaBoost and Gradient Boosting Machine.
- The third category contains algorithms made by Mathematical Functions. And linear regression can be considered as the base algorithm. From there, we can modify the loss function to get Lasso, Ridge, Elastic Net, SVIM, and more. One shortcoming of linear regression is that it is … linear. So to handle non-linear data, it has to be enhanced. We will talk about feature space mapping, GAM, and neural networks.
To better explain this big picture, we will go through the following themes step by step:
- The intuition behind the three main categories of algorithms: why these three categories and why in this order
- How the models are trained and used for both regression and classification (binary and multiclass)
- How categorical variable and numerical variable scaling are handled by these algorithms
- Relationship between models and their enhancement: the basic algorithms are not performant, and different approaches can be used to enhance them to overcome the shortcomings. And I also discovered that the way of enhancing one category of algorithms can come from another category.
The intuition behind the algorithms
How to define supervised machine learning? To answer this question concretely, one usually comes up with a machine learning algorithm. But which one would you choose? Which one is the most intuitive to understand for beginners?
When I try to decide the order of the three main categories, I wanted to reflect that there is a strong human intuition. For this, let’s mention some classic machine learning problems: house pricing prediction and Titanic survivor prediction. When asking some people with no knowledge of machine learning (because the idea is to relate machine intelligence to human intelligence), the idea of neighbors is easiest for house pricing prediction, and decisions trees are the most intuitive for Titanic survivor prediction.
It happens that applying a math function is the least intuitive way. We will also see that simple math functions are not performant and complex functions are not that intuitive…
For house price prediction, the answer is often: let’s look at similar housing sold in the neighborhood!
The idea of distance is very intuitive because, to predict a value for a new observation, we can search for similar observations. And mathematically, similarity and neighborhood will be considered as the same thing. The distance that we intuitively perceive is the geometrical distance (in this case, between two houses). This distance can also be generalized for two vectors.
Another reason for which this algorithm is intuitive is that there is no need to train.
Although Nearest Neighbors based algorithms are not frequently used in reality for several reasons, the idea is interesting, and we will be able to relate to other algorithms.
Combination of Rules
When it comes to Titanic survival prediction, a decision tree that is a combination of rules (if-else) is a very intuitive way to explain machine learning. And we often say that for certain business problems, before creating complex algorithms, we can first apply business rules.
We often ask people which rule should first be applied when it comes to predicting the survivorship of a passenger of Titanic. Many people come up with the correct idea of sex as the most important variable. And it is true if we try to build a decision tree. We can see that when people have domain expertise (ladies first!), their intuition is usually adequate with what algorithms predict.
In practice, the least intuitive way is to apply a math function. Unless you are an aficionado of linear regression.
All the different algorithms in this category are different mathematical functions (so the input must numerical, we will come back to this later), and then one loss function must be defined.
In this case, it is particularly important to distinguish the function of the model and the loss function. For example, the equation: y=ax+b is the same for these algorithms: linear regression, ridge, lasso, elastic net, and SVM. The difference is the loss function that is used for optimization.
Model training and usage
Let’s first define some keywords:
- models: each algorithm produces a model that is used for predictions (with new observations)
- training algorithms: how the models are obtained, for some fixed hyperparameters
- hyperparameter tuning: the process to find the most optimal hyperparameters to avoid overfitting or underfitting.
Parameter and hyperparameters
- Nearest neighbors: there is no training process, so no parameters are calculated. It is also called instance learning. Although there is no training algorithm involved, it is still interesting to mention that there can be different ways to find the neighbors (ball tree or kd tree). In practice, we can say that all the dataset is the model. So to use the model, we have to store all the datasets that can be huge. There are some hyperparameters to be optimized: the number k for KNN and radius for Radius Neighbors.
- Decision trees: the training process consists of finding the splits to build the trees. So the parameters are the rules to create the splits. And the hyperparameter tuning for one tree is the criteria (Gini or entropy) and the pruning (the depth of the tree, the number of observations in the leaves, the number of available observations to create splits, etc.). For aggregated trees, the number of trees is one hyperparameter. And depending on the ensemble method, there are specific hyperparameters. For example learning rate for Gradient Boosted Trees.
- Mathematical functions: the model consists of two main elements, the mathematical function (usually called the architecture in the case of deep learning models) and the coefficients for each predictor. The training algorithm consists of finding the coefficients. In practice, Gradient Descent is used by minimizing the cost function. Hyperparameters when available are often used to regularize the coefficients and to specify the characteristics of the feature space mapping (we will discuss this later. We can mention the different kernels we can choose for kernel SVM). The exact definition of hyperparameter can be blurred. Are Lasso, Ridge, or Elastic Net regressions different models from linear regression or are they only linear regressions with hyperparameters for the regularization for the parameters? The answer may not be very important. It is interesting to know that with sci-kit learn, logistic regression has directly these arguments (penalty and C). The model SGDclassifier or SGD regressor is quite interesting to consider because all these arguments are centralized (loss function, penalty, and alpha, l1 ratio). We can go further: A SVM model with a particular kernel can be considered as a different model, or it can be considered as a general SVM with a chosen kernel as a hyperparameter)? In the case of deep learning, we can mention some more: the number of layers, drop-out rate, the activation function, etc. In terms of the size of the model, it is often very small because only a few coefficients are stored. Except for deep learning, recent models can contain trillions of coefficients.
When creating the model, we are focused on the training process, but it is also important to understand how the model will be implemented. To better understand, we can imagine that you have to implement them in Excel.
- Nearest neighbors: we have to store all the dataset. For one observation, we can calculate the distance with all observations and look for the nearest neighbors by sorting by distance.
- Decision trees: we have to store all the rules, and it is usually difficult to do it by hand in Excel, even if it is one tree, let alone when are many trees.
- Mathematical functions: except for large deep learning models, the model can be easily stored in Excel: there are coefficients be to store in cells and we can also write functions, that will take new observations as input to calculate the predictions.
Regression and classification
When talking about supervised learning, in many overviews, we often see two sub-categories: regression and classification. As a reminder, a regression problem is when the target variable is continuous whereas a classification task is when the target variable is categorical.
I didn’t choose to present them as two subcategories of supervised learning algorithms because one algorithm can work for one or the other. And let’s also mention how the target variable is used in the training process.
For nearest neighbors algorithms, the target variable is not used for the training process, because only the predictors are used to find the neighbors. So the nature of the target value has no impact on the training process. When the neighbors are found: if the target variable is numerical, then the average value of the neighbors is used for the prediction; if the target variable is categorical (with two or more categories), the proportion of the classes is used for prediction. And the proportion can be considered as the probability. For binary classification or multiclass classification, there is no difference.
For decision tree-based models, the target variable is used to create splits or rules that make it homogeneous in the leaves. The structure of the model is the same. And to make a prediction, we first determine in which leaf is the new observation is. Then: for regression, we predict with the average value of observations in the leaf; for classification, we calculate the proportion of all classes. For binary classification or multiclass classification, we can see that there is no difference.
For mathematical functions models, the target variable is used in a cost function that should be minimized to find the coefficients for each predictor. For categorical variables, one hot encoding is used. Since a mathematical function takes into account as input only numerical values (we will see later), and also output only numerical values, we should a priori say they only work for regression tasks. So for a regression problem, there is no problem. What happens for classification tasks? There are three main solutions, and they can be generalized differently from binary classification to multiclass classification.
- Hyperplane by cutting the predicted target value in the middle: for binary classification, the target value is 0 or 1, we can still fit with linear regression. Then, if the output is higher than 0.5, the prediction is 1 and if the output is lower than 0.5, the prediction is 0. It works the same for SVM (Support Vector Machine). Sometimes, the target value is chosen to be -1 and 1 but it doesn’t change the results. Then the final equation is called hyperplane and it cuts the space into two parts. For multiclass classification (let m be the number of the classes), this method doesn’t generalize well, because we try to bring the problem back to a binary classification. We can use the one-versus-rest method, which will create (m-1) models by considering one class as 1, and all the others 0. We also have the one-versus-one method, that will create m*(m-1)/2 models because we try to create a model for each pair of the m classes.
- Logistic/softmax regression: another approach to solve the binary classification task is to use the GLM (Generalized Linear Model) approach by using a link function to limit the output value of the function between 0 and 1 with a monotonic and smooth transition. In general, there can be different functions that have this propriety, we can mention the logistic function (inverse function of logit) and the inverse function of probit. In practice, the most common function is the logistic function that has also very nice mathematical proprieties in terms of derivation. And it is worth mentioning to get the final prediction of the class, we also cut in the middle. So Logistic regression is also a Linear Classifier. What is different is that the output is a probability that can be used directly. For hyperplanes built with linear regression, or SVM, we cannot have probabilities natively. One approach is to use the logistic function to get the probabilities. (One interesting discussion here and here). For multiclass classification, the generalization is done elegantly with the softmax function. For neural networks, the final layer is often a softmax regression/classification for multiclass classification.
- GDA (Gaussian Discriminant Analysis): in this family of algorithms, we can find LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis), Naive Bayes Classification, etc. They are also called Generative Models. vs Discriminant models (even if there is the keyword Discriminant in these generative models! Here is the Wikipedia article). And they are only used for regression tasks a priori.
Feature variable handling
After we considered how the target variable is used in these algorithms, we can now consider feature variables.
Numerical variable and categorical variable
- Nearest neighbors: only numerical values can be taken into account to calculate the distance between two observations. Categorical variables are not handled. One can think of one-hot encoding, but they are not very meaningful. Label-coding would be a wrong way.
- Decision trees: since trees are built with rules that split observations successively into two groups, variables are always categorical in a certain way. Of course, numerical variables are handled but they are transformed into categories with splits. It is worth mentioning that only numerical variables are handled in sci-kit learn, so they are often one-hot encoded.
- Mathematical functions: only numerical values can be taken into account. So the categorical variables are often one-hot encoded.
Effect of numerical variable scaling
- Nearest neighbors: all numerical variables are taken into account in the same way (unless weights are assigned for different variables), the changing (be it scaling or other transformation) the values will change the distance calculated, so the results will be different.
- Decision trees: numerical variables will be cut (successively) into two parts, so scaling will not change the results of the model.
- Mathematical functions: for linear regression, the scaling has no impact. You can prove it mathematically. But for other models, the scaling will have impacts (if you have a counterexample, please share). The regularized versions of linear regression (Lasso, Ridge, Elastic net) take into account the quadratic error and the (L1, L2, or both) norm of the coefficients, so the coefficients will be impacted by the scaling of the numerical variables.
Missing value handling
- Nearest neighbors: since only numerical variables are taken into account, the missing values cannot be handled and various imputation methods can be used.
- Decision trees: missing values can be handled as a category, without some particular imputations.
- Mathematical functions: for numerical variables, as for nearest neighbors, imputations have to be done and the missing values should be estimated in a meaningful way. For example, setting them to 0 would not be a suitable solution since it can create discontinuities and anomalies. For example, imputation by mean value is often used. For categorical variables, the “most frequent category” method can be used, but sometimes it may not be the most suitable way. For example, if we can they the values are missing for some particular reason, one particular category is created. Since they are one-hot encoded, there is no problem with discontinuity.
Feature importance can be model-agnostic, here we only consider how the algorithm can directly give some insights about how the features are used and therefore their importance in the prediction process.
- Nearest neighbors: since all variables are used in the same way for the distance computation, the notion of feature importance is not relevant here. However, it is worth noting that if we know that some features should have more importance (according to business considerations), we can use weights for the predictors.
- Decision trees: there can be different ways of estimating the feature importance. The indicators that can be used are: how often one variable is used in the splits, how impactful is the split (in terms of Gini or entropy variations). There can be a difference between continuous variables and categorical variables. Since continuous variables are cut into categories to create splits, one has more chance to be used since they can be cut several times. For one categorical variable, we can consider its importance for each of its categories or all categories.
- Mathematical functions: for each predictor, there is usually one coefficient (there are more for neural networks). They can be used to consider the feature importance. This doesn’t mean that the higher the coefficient, the higher the feature importance. We also have to consider the scale of the features (this article explains some pitfalls of its interpretation). In traditional linear regression modeling, we often consider the p-value (with null hypothesis tests about the coefficients) as the significance of the coefficients. (I have not checked yet: is it equivalent to consider the p-value and the coefficients when the features are standardized?)
Model enhancement and relationships with other algorithms
After all previous analysis, we now have some ideas of advantages and drawbacks for the three basic algorithms of the three categories of algorithms. Since the shortcomings are different, the ways of improving the models are different. And that is where we can also relate different algorithms of the three different categories.
Nearest neighbors models
We can improve the nearest neighbors in the following ways:
- one major drawback is that it is not a model since it uses all the dataset. One extreme simplification is to compute the centroids for each class. Then use the centroids rather than all the data points to calculate the distance. It is interesting to consider this approach because it can be related to k-means (yes it is an unsupervised learning algorithm) and GDA.
- another way to enhance nearest neighbors is to improve the notion of distance. Euclidean distance is usually used, and we can use its generalized version (Minkowski distance) with different power parameters. But we also can use other types of mathematical distance: for example, the Mahalanobis distance will lead it to LDA. The idea is to consider that different data points at different distances can be taken into account differently. For Nearest neighbors, the data point is either “a neighbor” or “not a neighbor”, whereas, for GDA or Logistic regression, the distance is modeled in a parametric way with the data points.
What else? Is there a relationship between Nearest Neighbors and Decision Trees? We can say that the leaves of decision trees contain neighbors. There is no distance used, but neighbors are found in another way, with rules. So you don’t have to store real neighbors to do predictions, you only need to store rules that help you to find your neighbors. And you don’t need them once they are found because only the prediction is needed and it can be stored.
Decision tree-based models
For decision trees, the approach to improve the model is mainly aggregating the trees. And aggregation/addition is a mathematical function. So we can relate Boosting to GAM (Generalized Additive Model). It is a very powerful idea because to optimize the trees aggregation in Gradient Boosting Machine, Gradient Descent is used. (Here is an interesting discussion) Loss functions are also considered for various optimization for different tasks.
It is worth noting that Boosting can be used for all sorts of base algorithms, but for linear regression, adding linear regression will still be linear (here and here we can find some interesting discussions). Whereas one interesting characteristic of the decision tree is that it is non-linear by default (here).
Another idea is to introduce some functions for the observations in the leaves. By default, the average value is computed for the regression task for example. Now the idea is to introduce more complex relationships, for example, a linear relationship. We can find some references here.
The approach can be used, in a reversed way for linear regression, instead of creating a single linear relationship for all the scope of the predictors, one can think of cutting the features variables into different regions and creating linear regressions.
Linear models enhancement
Let’s mention some cases where linear regression fails and how to improve it:
- when using linear regression to do classification, it is very sensitive to outliers, that’s why logistic regression or SVM is more suitable. Logistic regression minimizes the errors made by outliers (those that are far from the frontier of the two classes). SVM simply ignores outliers.
- when the target variable has some specific propriety: only positive values, count data, binary value, it is better if we can reflect these by considering a link function with GLM or a loss function.
- one major drawback of linear regression is that it is linear and the real world is full of non-linearities
To create non-linear models, there are different approaches:
- Feature engineering: one would argue feature engineering is not an algorithm, and it should be done in the preprocessing phase. I mention it here for two reasons: 1) the boundary between feature engineering and predictive modeling can be blurred. Especially for deep learning models. There is also an interesting consideration about whether or not linear regression can be overfitted. We usually take the example of polynomial regression to illustrate the overfitting of linear regression. Some may argue that polynomial regression is not a simple linear regression, but another more complex regression algorithm with a hyperparameter which is the degree of the polynomial. But I also can consider that the polynomial features can exist in the real world. Then linear regression is applied (here is an interesting discussion). 2) the feature engineering techniques can be related to other algorithms. For example, binning is using decision trees (and they are expert-based) to cut continuous variables into categories. Feature interaction is like feature space mapping. Log transformation is like applying a link function.
- Local regression: we already discussed its relationship with nearest neighbors models.
- Feature space mapping: the idea is to transform project the non-linear data into a feature space where the data become linear. One simple example is polynomial regression, even if it is not usually presented in this way. Here is a lecture note that is very helpful to unify the approach. With regularization, it becomes kernel ridge. It is mostly used for SVM because of its computational performance. It is worth noting the difference between feature space mapping and kernel trick. (Here is an interesting discussion).
- Combination of functions: GAM uses the additive approach to combine functions, whereas neural networks use layers of activation functions in a globally linear function (or softmax regression). The intermediaries layers consist of transforming non-linear data into a final space where the data can be modeled in a linear way. You can find an article I wrote below to visualize the transformation of the data by a neural network.
For classification algorithms, in particular, we can mention more specificities.
For the relationship between Linear regression and LDA, there is an interesting discussion here.
For LDA and logistic regression, I wrote an article to show that they are in fact very similar. They are the same model (with the same mathematical function, only the coefficients before the variables are different).Intuitively, How Can We (Better) Understand Logistic RegressionLogistic Regression and Linear Discriminant Analysis are closely related. Here is an intuitive way to understand them…towardsdatascience.com
For more details, you can write this article that I wrote about the transformation of linear classifiers to non-linear classifiers, with some visualizations.
When a general question is asked for all machine learning algorithms, I think that it is important to bear in mind that the answer can be different depending on the algorithm. I hope that this overview can be useful to answer it thoroughly. Please let me know if you have any questions and each discussion helps me to better understand these all algorithms.
Some may notice that there are some algorithms mentioned in the first overview that I didn’t address in the article: ARIMA, RNN, CNN, Attention mechanism, etc. I am writing similar articles about unsupervised learning algorithms, computer vision techniques, NLP techniques, deep learning algorithms. Please subscribe si you are interested!