Over the past year, I have taken more and more interest in Bayesian statistics and probabilistic modeling. Along this journey, I have encountered the latent probabilistic models.
We will start by explaining the concept of a latent variable. And to properly understand its benefits we need to make sure that you are familiar with the Bayesian framework for linear regression and this is what this first article is about.
As a quick note, I am no expert in Bayesian statistics. But I wanted to share the current state of my knowledge because I know it can help some people (myself included). Let’s dive in, shall we!
This post was inspired by this excellent course on Coursera: Bayesian Methods for machine learning. If you are into machine learning I definitely recommend this course.
The Bayesian framework for linear regression
In a typical regression problem, a response variable y is modeled as a weighted linear combination of predictors and can be expressed for example with 7 predictors using the following formula:
Or in plain matrix notation:
In the frequentist paradigm we find the vectors of weights w with maximum likelihood estimation (MLE) by computing the total sum of squares and resolving the following optimization problem:
There exist a closed-form solution to this problem which can be expressed as:
I don’t want to get into the details of how this formula can be found neither some of the assumptions that this model has to fulfill because it is not the subject of this post.
On the contrary, what we would like to point out is that what we get from the resolution of the maximum likelihood estimation is a vector of parameters w that is to be regarded as a point estimate in the parameter space.
But as we are adopting the Bayesian point of view today, we would like to express our regression problem using probability distributions.
In the Bayesian perspective, the linear regression problem is expressed using the language of probabilities. In order to do that, what we can do is start by drawing our random variables and connect them when we think that one has a direct impact on the other (this is called a Bayesian network).
We can now ask ourselves what is the probability of observing such as situation, ie. what is the joint probability of this model? Given the definition of conditional probabilities, we know that:
But in fact, we are not really interested in modeling the joint probability. What we are really interested in is modeling P(y,w|X). We want to model the probability of observing a specific response with a specific set of parameters given the data. This will furtherly allow us to perform inference about the parameter vector w.
Now we know that given both the definition of joint probabilities and conditional probabilities :
Does this look familiar? Yet it does, it is the Bayes formula:
In the Bayesian viewpoint, finding the probability distribution of the parameters w is regarded as solving the inversion problem. We want to find the posterior distribution over the parameters given the response and the data.
But there is no direct causation (as one can see in our model) between the data and the weights. Those random variables are independent. So we can safely write that :
In order to compute the posterior, we first need to define the probabilities of the likelihood and the prior. The most common way to do that is to say that the response variable y is sampled from the following normal distribution:
This bit is common with the frequentist view that we find in the ordinary least square regression. But in the Bayesian perspective, you also come with a prior belief about the parameters. This prior can simply be the standard normal distribution:
Now we know how to compute the likelihood and the prior (using the analytical formula for the normal distribution). But in order to compute the posterior, we also need to compute the evidence. In reality, the evidence acts as a normalization constant. It allows to posterior to be a proper probability distribution integrating to 1 by taking into account all the other possible alternatives. And with the following simple derivation…
…we see more clearly that computing the evidence is about being able to compute the probability of the given data.
On the computation of the evidence
For the sake of this article, I said that in order to compute the posterior you need to compute the evidence. But this is not entirely right. In fact, what you can do is simply compute the maximum a posteriori by solving the following maximization problem:
As we can see, the evidence P(y|X) does not depend on the parameters w. So we don’t need to compute it to find the MAP estimate (we only need a numerical optimization method like gradient descent). But there are a few issues with using the MAP estimate and the major one is that you will only get a point estimate. We want to get credible regions along with our parameter values.
Another way to avoid computing the evidence is to use conjugate priors. The prior is said to be conjugate to the likelihood if the prior and the posterior come from the same family of distributions. For example, if the prior and the likelihood come from a normal distribution, and because the product of two normal distributions is also a normal distribution than you know that the posterior will also come from a normal distribution. Furthermore, you can compute the posterior distribution using a closed-form expression. The Gaussian family is conjugate to itself. In fact, all members of the exponential family have conjugate priors.
So let’s make a pause right now and think for a moment of what we have here. We have defined a way to compute the posterior given a prior and a likelihood (up to a normalization constant). In the Bayesian viewpoint, you come up with a certain belief (the prior) and in light of the data, you compute a likelihood that will allow you to update this belief and get the unnormalized posterior.
We now know how to compute a posterior and we have ways to make it numerically tractable (by using conjugate priors). But what if our prior does not come from the exponential family? Well, we can not use conjugate priors and we need to compute the evidence. What else can we do? We will give part of the answer in the next section.
What is a latent variable model?
A latent variable also called a hidden variable is a random variable that explains the process you are trying to model in some way but for which you don’t have any data because the concept being represented is generally not directly measurable.
An example is worth a thousand words so we will give a fake but realistic example.
Framing the problem
Let’s say you and your wife are planning to move abroad. You want to try the experience but you don’t know yet the country you want to go to.
So you decide to collect some data and predict your settlement satisfaction on a continuous scale from 0.0 to 100.0. Thanks to Google and their new data sources search engine you manage to get your hands on a dataset with the following properties on the people living in the countries of your
list: wealth (x1), employment (x2), environment (x3), physical and mental health (x4), education (x5), recreation and leisure time (x6), and social belonging (x7). Also, because you are quite a resourceful person, you managed to find a big list of immigrants for certain countries on your list. You contacted each of them and asked them to score their respective settlement satisfaction (y). But this takes a lot of time so you were not able to collect data for every country. You are now faced with a standard regression problem. You have training data regarding the properties that can affect the settlement satisfaction of immigrants and you want to make predictions for the countries you don’t have satisfaction scores.
All of these properties have an impact on one another. For example, having a good education will have an impact on the employment rate, and having a good employment rate will allow people to send their kids to good schools, etc… So if we were to represent the model using a bayesian network with the nodes being the random variables and the edges the direct causation between them, we would end up with something like this :
Everything is connected to everything. Now a Bayesian network does not give you the analytical formula to compute the probability of observing a specific response with a specific set of predictors. What it tells you is how your random variables can influence each other.
In fact, it is up to you, the modeler, to define the probability model. So, as with any regression problem, we start by assigning a weight to each of our variables. Now what we could simply say is that the response variable y is a weighted linear combination of the predictors. So the probability of obtaining a specific response with a specific set of predictors, or in other words, the joint probability of the response and the data is a linear combination of the weights and the data divided by a normalization constant :
The normalization constant Z ensures that we get back a proper probability distribution (the sum of probabilities integrates to 1). And this constant is equal to the sum of all possible weight configurations :
So far so good. As you can see, because we defined the probability of the response variable given the data as a simple linear combination of the predictors, the computation of the normalization constant is tractable (because we can compute the integrals independently).
Now, what if we decide to use a non-linear function to define the probability model, like so:
Then the normalization constant to compute the evidence becomes:
The computation is now way more complex, making it impractical. So how can we simplify this problem to make the computation of the evidence tractable? Well, we are going to introduce a latent variable called “Quality of life” like so:
Now, what does this mean? We are not modeling the links between the factors anymore but they still influence one another through this central notion of “Quality of life”. This has simplified our problem by a huge margin because we can now compute the probability of having each factor independently given a specific level of “Quality of life”.
This means that we can marginalize out the quality of life :
This trick has reduced by a huge margin the number of computations we need to perform because we can now compute the probabilities independently instead of for every possible combination.
Also, one notable property of this model is that by introducing the latent variable we have found a way to capture the notion of “Quality of life”. Or at least we have tried our best to do so. And this can be in its own right the very purpose of using a latent variable. Introducing them can indeed be a way to measure concepts that are not directly measurable.
In this article, we have reviewed the foundation of Bayesian statistics and how we express a regression problem using the language of probabilities.
We have seen that solving the inversion problem can be extremely difficult and we have introduced ways to avoid that complexity.
Among those ways, we find the usage of latent variable models.
As we will see in the following article, latent variable models are not only used to reduce complexity but can also be used to frame a specific problem in order to come up with an algorithm. For example, in order to produce a soft clustering model like the Gaussian Mixture model.