 # Bayesian Linear Regression in Machine Learning

Autor:in
Lesezeit
20 ​​min

This article takes a look at the mathematics of Machine Learning with focus on Bayesian linear regression. In the course of the article, we will work our way from linear regression to Bayesian linear regression, including the most important theoretical knowledge and code examples.

The content of this blog post was originally published as a Jupyter notebook here. You can run the code interactively in your browser using Binder.

## What is Bayesian linear regression (BLR)?

Bayesian linear regression is the Bayesian interpretation of linear regression. What does that mean? To answer this question we first have to understand the Bayesian approach. In most of the algorithms we have looked at so far, we computed point estimates of our parameters. For example, in linear regression we chose values for the weights and bias that minimized our mean squared error cost function. In the Bayesian approach we do not work with exact values but with probabilities. This allows us to model the uncertainty in our parameter estimates. Why is this important?

In nearly all real-world situations, our data and knowledge about the world is incomplete, indirect and noisy. Hence, uncertainty must be a fundamental part of our decision-making process. This is exactly what the Bayesian approach is about. It provides a formal and consistent way to reason in the presence of uncertainty. Bayesian methods have been around for a long time and are widely-used in many areas of science (e.g. astronomy). Although Bayesian methods have been applied to machine learning problems too, they are usually less well known to beginners. The major reason is that they require a good understanding of probability theory.

## Recap linear regression

• In linear regression, we want to find a function $$f$$ that maps inputs $$x \in \mathbb{R}^D$$ to corresponding function values $$f(x) \in \mathbb{R}$$.
• We are given an input dataset $$D = \big \{ \mathbf{x}_n, y_n \big \}_{n=1}^N$$, where $$y_n$$ is a noisy observation value: $$y_n = f(x_n) + \epsilon$$, with $$\epsilon$$ being an i.i.d. random variable that describes measurement/observation noise.
• Our goal is to infer the underlying function $$f$$ that generated the data such that we can predict function values at new input locations.
• In linear regression, we model the underlying function $$f$$ using a linear combination of the input features:

$$\begin{split} y &= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + … + \theta_d x_d \\ &= \pmb{x}^T \pmb{\theta} \end{split}$$

• For more details take a look at the notebook on linear regression.
• ## Fundamental concepts

• One fundamental tool in Bayesian learning is Bayes‘ theorem.
• Bayes‘ theorem looks as follows:
$$\begin{equation} p(\pmb{\theta} | \mathbf{x}, y) = \frac{p(y | \pmb{x}, \pmb{\theta})p(\pmb{\theta})}{p(\pmb{x}, y)} \end{equation}$$
• $$p(y | \pmb{x}, \pmb{\theta})$$ is the likelihood. It describes the probability of the target values given the data and parameters.
• $$p(\pmb{\theta})$$ is the prior. It describes our initial knowledge about which parameter values are likely and unlikely.
• $$p(\pmb{x}, y)$$ is the evidence. It describes the joint probability of the data and targets.
• $$p(\pmb{\theta} | \pmb{x}, y)$$ is the posterior. It describes the probability of the parameters given the observed data and targets.
• Another important tool you need to know about is the Gaussian distribution. If you are not familiar with it, I suggest you pause for a minute and understand its main properties before reading on.

The general Bayesian inference approach is as follows:

2. We observe some data, gaining new evidence $$e$$.
3. We update our belief using Bayes‘ theorem, resulting.

Finally, we use Bayes‘ theorem
$$p(h|e) = \frac{p(e |h)p(h)}{p(e)}$$
to incorporate the new evidence yielding a refined posterior belief.

## Linear regression from a probabilistic perspective

In order to pave the way for Bayesian linear regression, we will take a probabilistic spin on linear regression. Let’s start by explicitly modelling the observation noise $$\epsilon$$. For simplicity, we assume that $$\epsilon$$ is normally distributed with mean $$0$$ and some known variance $$\sigma^2$$: $$\epsilon \sim \mathcal{N}(0, \sigma^2)$$.

As mentioned in the beginning, a simple linear regression model assumes that the target function $$f(x)$$ is given by a linear combination of the input features:
$$\begin{split} y = f(\pmb{x}) + \epsilon \\ = \pmb{x}^T \pmb{\theta} + \epsilon \end{split}$$

This corresponds to the following likelihood function:
$$p(y | \pmb{x}, \pmb{\theta}) = \mathcal{N}(\pmb{x}^T \pmb{\theta}, \sigma^2)$$

Our goal is to find the parameters $$\pmb{\theta} = \{\theta_1, …, \theta_D\}$$ that model the given data best.
In standard linear regression we can find the best parameters using a least-squares, maximum likelihood (ML) or maximum a posteriori (MAP) approach. If you want to know more about these solutions, take a look at the notebook on linear regression or at chapter 9.2 of the book Mathematics for Machine Learning.

## Linear regression with basis functions

The simple linear regression model above is linear not only with respect to the parameters $$\pmb{\theta}$$ but also with respect to the inputs $$\pmb{x}$$. When $$\pmb{x}$$ is not a vector but a single value (that is, the dataset is one-dimensional), the model $$y_i = x_i \cdot \theta$$ describes straight lines with $$\theta$$ being the slope of the line.

The plot below shows example lines produced with the model $$y = x \cdot \theta$$, using different values for the slope $$\theta$$ and intercept 0. Having a model which is linear both with respect to the parameters and inputs limits the functions it can learn significantly. We can make our model more powerful by making it nonlinear with respect to the inputs. After all, linear regression refers to models which are linear in the parameters, not necessarily in the inputs (linear in the parameters means that the model describes a function by a linear combination of input features).

Making the model nonlinear with respect to the inputs is easy. We can adapt it by using a nonlinear transformation of the input features $$\phi(\pmb{x})$$. With this adaptation our model looks as follows:
$$\begin{split} y &= \pmb{\phi}^T(\pmb{x}) \pmb{\theta} + \epsilon \\ &= \sum_{k=0}^{K-1} \theta_k \phi_k(\pmb{x}) + \epsilon \end{split}$$

Where $$\pmb{\phi}: \mathbf{R}^D \rightarrow \mathbf{R}^K$$ is a (non)linear transformation of the inputs $$\pmb{x}$$ and $$\phi_k: \mathbf{R}^D \rightarrow \mathbf{R}$$ is the $$k-$$th component of the feature vector $$\pmb{\phi}$$:

$$\pmb{\phi}(\pmb{x})=\left[\begin{array}{c} \phi_{0}(\pmb{x}) \\ \phi_{1}(\pmb{x}) \\ \vdots \\ \phi_{K-1}(\pmb{x}) \end{array} \right] \in \mathbb{R}^{K}$$

With our new nonlinear transformation the likelihood function is given by

$$p(y | \pmb{x}, \pmb{\theta}) = \mathcal{N}(\pmb{\phi}^T(\pmb{x}) \pmb{\theta},\, \sigma^2)$$

### Example basis functions

#### Linear regression

The easiest example for a basis function (for one-dimensional data) would be simple linear regression, that is, no nonlinear transformation at all. In this case, we would choose $$\phi_0(x) = 1$$ and $$\phi_i(x) = x$$. This would result in the following vector $$\pmb{\phi}(x)$$:

$$\pmb{\phi}(x)= \left[ \begin{array}{c} \phi_{0}(x) \\ \phi_{1}(x) \\ \vdots \\ \phi_{K-1}(x) \end{array} \right] = \left[ \begin{array}{c} 1 \\ x \\ \vdots \\ x \end{array} \right] \in \mathbb{R}^{K}$$

#### Polynomial regression

Another common choice of basis function for the one-dimensional case is polynomial regression. For this we would set $$\phi_i(x) = x^i$$ for $$i=0, …, K-1$$. The corresponding feature vector $$\pmb{\phi}(x)$$ would look as follows:

$$\pmb{\phi}(x)= \left[ \begin{array}{c} \phi_{0}(x) \\ \phi_{1}(x) \\ \vdots \\ \phi_{K-1}(x) \end{array} \right] = \left[ \begin{array}{c} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \\ x^{K-1} \end{array} \right] \in \mathbb{R}^{K}$$

With this transformation we can lift our original one-dimensional input into a $$K$$-dimensional feature space. Our function $$f$$ can be any polynomial with degree $$\le K-1$$: $$f(x) = \sum_{k=0}^{K-1} \theta_k x^k$$

### The design matrix

To make it easier to work with the transformations $$\pmb{\phi}(\pmb{x})$$ for the different input vectors $$\pmb{x}$$, we typically create a so-called design matrix (also called feature matrix). Given our dataset $$D = \big \{ \mathbf{x}_n, y_n \big \}_{n=1}^N$$, we define the design matrix as follows:

$$\boldsymbol{\Phi}:=\left[\begin{array}{c} \boldsymbol{\phi}^{\top}\left(\boldsymbol{x}_{1}\right) \\ \vdots \\ \boldsymbol{\phi}^{\top}\left(\boldsymbol{x}_{N}\right) \end{array}\right]=\left[\begin{array}{ccc} \phi_{0}\left(\boldsymbol{x}_{1}\right) & \cdots & \phi_{K-1}\left(\boldsymbol{x}_{1}\right) \\ \phi_{0}\left(\boldsymbol{x}_{2}\right) & \cdots & \phi_{K-1}\left(\boldsymbol{x}_{2}\right) \\ \vdots & & \vdots \\ \phi_{0}\left(\boldsymbol{x}_{N}\right) & \cdots & \phi_{K-1}\left(\boldsymbol{x}_{N}\right) \end{array}\right] \in \mathbb{R}^{N \times K}$$

Note that the design matrix is of shape $$N \times K$$. $$N$$ is the number of input examples and $$K$$ is the output dimension of the nonlinear transformation $$\pmb{\phi}(\pmb{x})$$.

## Bayesian linear regression

What changes when we consider a Bayesian interpretation of linear regression? Our data stays the same as before: $$D = \big \{ \mathbf{x}_n, y_n \big \}_{n=1}^N$$. Given the data $$D$$, we can define the set of all inputs as $$\mathcal{X} := \{\pmb{x}_1, …, \pmb{x}_n\}$$ and the set of all targets as $$\mathcal{Y} := \{y_1, …, y_n \}$$.

In simple linear regression, we compute point estimates of our parameters (e.g. using a maximum likelihood approach) and use these estimates to make predictions. Different to this, Bayesian linear regression estimates distributions over the parameters and predictions. This allows us to model the uncertainty in our predictions.

To perform Bayesian linear regression, we follow three steps:

1. We set up a probabilistic model that describes our assumptions how the data and parameters are generated.
2. We perform inference for the parameters $$\pmb{\theta}$$, that is, we compute the posterior probability distribution over the parameters.
3. With this posterior, we can perform inference for new, unseen inputs $$y_*$$. In this step, we do not compute point estimates of the outputs. Instead, we compute the parameters of the posterior distribution over the outputs.

### Step 1: Probabilistic model

We start by setting up a probabilistic model that describes our assumptions how the data and parameters are generated. For this, we place a prior $$p(\pmb{\theta})$$ over our parameters which encodes what parameter values are plausible (before we have seen any data). Example: With a single parameter $$\theta$$, a Gaussian prior $$p(\theta) = \mathcal{N}(0, 1)$$ says that parameter values are normally distributed with mean 0 and standard deviation 1. In other words: the parameter values are most likely to fall into the interval $$[−2,2]$$ which is two standard deviations around the mean value.

To keep things simple, we will assume a Gaussian prior over the parameters: $$p(\pmb{\theta}) = \mathcal{N}(\pmb{m}_0, \pmb{S}_0)$$. Let’s further assume that the likelihood function is Gaussian, too: $$p(y \mid \pmb{x}, \pmb{\theta})=\mathcal{N}\left(y \mid \pmb{\phi}^{\top}(\pmb{x}) \pmb{\theta}, \sigma^{2}\right)$$.

Note: When considering the set of all targets $$\mathcal{Y} := \{y_1, …, y_n \}$$, the likelihood function becomes a multivariate Gaussian distribution: $$p(\mathcal{Y} \mid \mathcal{X}, \pmb{\theta})=\mathcal{N}\left(\pmb{y} \mid \pmb{\Phi} \pmb{\theta}, \sigma^{2} \pmb{I}\right)$$

The nice thing about choosing a Gaussian distribution for our prior is that the posterior distributions will be Gaussian, too (keyword conjugate prior)!

We will start our BayesianLinearRegression class with the knowledge we have so far – our probabilistic model. As mentioned in the beginning, we assume that the variance $$\sigma^2$$ of the noise $$\epsilon$$ is known. Furthermore, to allow plotting the data later on we will assume that it’s two dimensional (d=2).

### Generating a dataset

Before going any further, we need a dataset to test our implementation. Remember that we assume that our targets were generated by a function of the form $$y = \pmb{\phi}^T(\pmb{x}) \pmb{\theta} + \epsilon$$ where $$\epsilon$$ is normally distributed with mean $$0$$ and some known variance $$\sigma^2$$: $$\epsilon \sim \mathcal{N}(0, \sigma^2)$$.

To keep things simple, we will work with one-dimensional data and simple linear regression (that is, no nonlinear transformation of the inputs). Consequently, our data generating function will be of the form
$$y = \theta_0 + \theta_1 \, x + \epsilon$$

Note that we added a parameter $$\theta_0$$ which corresponds to the intercept of the linear function. Until know we assumed $$\theta_0 = 0$$. As mentioned earlier, $$\theta_1$$ represents the slope of the linear function. ### Step 2: Posterior over the parameters

We finished setting up our probabilistic model. Next, we want to use this model and our dataset $$\mathcal{X, Y}$$ to estimate the parameter posterior $$p(\pmb{\theta} | \mathcal{X, Y})$$. Keep in mind that we do not compute point estimates of the parameters. Instead, we determine the mean and variance of the (Gaussian) posterior distribution and use this entire distribution when making predictions.

We can estimate the parameter posterior using Bayes theorem:
$$p(\pmb{\theta} \mid \mathcal{X}, \mathcal{Y})=\frac{p(\mathcal{Y} \mid \mathcal{X}, \pmb{\theta}) p(\pmb{\theta})}{p(\mathcal{Y} \mid \mathcal{X})}$$

• $$p(\mathcal{Y} \mid \mathcal{X}, \pmb{\theta})$$ is the likelihood function, $$p(\mathcal{Y} \mid \mathcal{X}, \pmb{\theta})=\mathcal{N}\left(\pmb{y} \mid \pmb{\Phi} \pmb{\theta}, \sigma^{2} \pmb{I}\right)$$
• $$p(\pmb{\theta})$$ is the prior distribution, $$p(\pmb{\theta})=\mathcal{N}\left(\pmb{\theta} \mid \pmb{m}_{0}, \pmb{S}_{0}\right)$$
• $$p(\mathcal{Y} \mid \mathcal{X})=\int p(\mathcal{Y} \mid \mathcal{X}, \pmb{\theta}) p(\pmb{\theta}) \mathrm{d} \pmb{\theta}$$ is the evidence which ensures that the posterior is normalized (that is, that it integrates to 1).

The parameter posterior can be estimated in closed form (for proof see theorem 9.1 in the book Mathematics for Machine Learning):
\begin{aligned} p(\pmb{\theta} \mid \mathcal{X}, \mathcal{Y}) &=\mathcal{N}\left(\pmb{\theta} \mid \pmb{m}_{N}, \pmb{S}_{N}\right) \\ \pmb{S}_{N} &=\left(\pmb{S}_{0}^{-1}+\sigma^{-2} \pmb{\Phi}^{\top} \pmb{\Phi}\right)^{-1} \\ \pmb{m}_{N} &=\pmb{S}_{N}\left(\pmb{S}_{0}^{-1} \pmb{m}_{0}+\sigma^{-2} \pmb{\Phi}^{\top} \pmb{y}\right) \end{aligned}

Coming back to our BayesianLinearRegression class, we need to add a method which allows us to update the posterior distribution given a dataset.

### Visualizing the parameter posterior

To ensure that our implementation is correct, we can visualize how the posterior over the parameters changes as the model sees more data. We will visualize the distribution using a contour plot – a method for visualizing three-dimensional functions. In our case we want to visualize the density of our bi-variate Gaussian for each point (that is, each slope/intercept combination). The plot below shows an example which illustrates how the lines and colors of a contour plot correspond to a Gaussian distribution: As we can see, the density is highest in the yellow regions decreasing when moving further out into the green and blue parts. This should give you a better understanding of contour plots.

To analyze our Bayesian linear regression class, we will start by initializing a new model. We can visualize its prior distribution over the parameters before the model has seen any real data. The plot above illustrates both the prior parameter distribution and the true parameter values that we want to find. If our model works correctly, the posterior distribution should become more narrow and move closer to the true parameter values as the model sees more datapoints. This can be visualized with contour plots, too! Below we update the posterior distribution iteratively as the model sees more and more data. The contour plots for each step show how the parameter posterior develops and converges close to the true values in the end.        ### Step 3: Posterior predictive distribution

Given the posterior distribution over the parameters, we can determine the predictive distribution (= posterior over the outputs) for a new input $$(\pmb{x}_*, y_*)$$. This is the distribution we are really interested in. A trained model is not particularly useful when we cannot use it to make predictions, right?

The posterior predictive distribution looks as follows:

\begin{aligned} p\left(y_{*} \mid \mathcal{X}, \mathcal{Y}, \pmb{x}_{*}\right) &=\int p\left(y_{*} \mid \pmb{x}_{*}, \pmb{\theta}\right) p(\pmb{\theta} \mid \mathcal{X}, \mathcal{Y}) \mathrm{d} \pmb{\theta} \\ &=\int \mathcal{N}\left(y_{*} \mid \pmb{\phi}^{\top}\left(\pmb{x}_{*}\right) \pmb{\theta}, \sigma^{2}\right) \mathcal{N}\left(\pmb{\theta} \mid \pmb{m}_{N}, \pmb{S}_{N}\right) \mathrm{d} \pmb{\theta} \\ &=\mathcal{N}\left(y_{*} \mid \pmb{\phi}^{\top}\left(\pmb{x}_{*}\right) \pmb{m}_{N}, \pmb{\phi}^{\top}\left(\pmb{x}_{*}\right) \pmb{S}_{N} \pmb{\phi}\left(\pmb{x}_{*}\right)+\sigma^{2}\right) \end{aligned}

First of all: note that the predictive posterior for a new input $$\pmb{x}_{*}$$ is a univariate Gaussian distribution. We can see that the mean of the distribution is given by the product of the design matrix for the new example ($$\pmb{\phi}^{\top}\left(\pmb{x}_{*}\right)$$) and the mean of the parameter posterior ($$\pmb{m}_{N}$$). The variance $$(\pmb{\phi}^{\top}\left(\pmb{x}_{*}\right) \pmb{S}_{N} \pmb{\phi}\left(\pmb{x}_{*}\right)+\sigma^{2}$$) of the predictive posterior has two parts:

1. $$\sigma^{2}$$: The variance of the noise
2. $$\pmb{\phi}^{\top}\left(\pmb{x}_{*}\right) \pmb{S}_{N} \pmb{\phi}\left(\pmb{x}_{*}\right)$$: The posterior uncertainty associated with the parameters $$\pmb{\theta}$$

Let’s add a predict method to our BayesianLinearRegression class which computes the predictive posterior for a new input (you will find the method in the class definition above):

### Visualizing the predictive posterior

Our original dataset follows a simple linear function. After training the model it should be able to predict labels for new datapoints, even if they lie beyond the range from [-1.5, 1.5]. But how can we get from the predictive distribution that our model computes to actual labels? That is easy: we sample from the predictive posterior.

To make sure that we are all on the same page: given a new input example our Bayesian linear regression model predicts not a single label but a distribution over possible labels. This distribution is Gaussian. We can get actual labels by sampling from this distribution.

The code below implements and visualizes this:

• We create some test features for which we want predictions.
• Each feature is given to the trained BLR model which returns a univariate Gaussian distribution over possible labels (pred_posterior = blr.predict(np.array([feat]))).
• We sample from this distribution (sample_predicted_labels = pred_posterior.rvs(size=sample_size)).
• The predicted labels are saved in a format that makes it easy to plot them.
• Finally, we plot each input feature, its true label and the sampled predictions. Remember: the samples are generated from the predictive posterior returned by the predict method. Think of a Gaussian distribution plotted along the y-axis for each feature. We visualize this with a histogram: more likely values close to the mean will be sampled more often than less likely values. 