Multivariate normal distribution

This post will introduce you to the multivariate normal (multivariate Gaussian) distribution. It will show you how to represent, visualize, sample, and compute conditionals from this distribution.

This post assumes a basic understanding of probability theory, probability distributions and linear algebra.

In [1]:

Univariate normal distribution

The normal distribution , also known as the Gaussian distribution, is so called because its based on the Gaussian function . This distribution is defined by two parameters: the mean $\mu$, which is the expected value of the distribution, and the standard deviation $\sigma$, which corresponds to the expected deviation from the mean. The square of the standard deviation is typically referred to as the variance $\sigma^2$. We denote this distribution as:

$$ \mathcal{N}(\mu, \sigma^2) $$

Given this mean and variance we can calculate the probility densitiy function (pdf) of the normal distribution with the normalised Gaussian function. For a value $x$ the density is:

$$ p(x \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\textstyle \left( -\frac{(x - \mu)^2}{2\sigma^2}\right)} $$

We call this distribution the univariate normal because it consists of only one random normal variable. Some examples of univariate normal distributions with different mean and variance are plotted in the next figure:

In [2]:
def univariate_normal(x, mean, variance):
    """pdf of the univariate normal distribution."""
    return (1. / np.sqrt(2 * np.pi * variance)) * np.exp(
        -(x - mean)**2 / (2 * variance))
In [3]:

Multivariate normal distribution

The multivariate normal distribution is a multidimensional generalisation of the one-dimensional normal distribution . It represents the distribution of a multivariate random variable that is made up of multiple random variables that can be correlated with eachother.

Like the normal distribution, the multivariate normal is defined by sets of parameters: the mean vector $\mathbf{\mu}$, which is the expected value of the distribution; and the covariance matrix $\Sigma$, which measures how dependend two random variables are and how they change together. We denote the covariance between variable $X$ and $Y$ as $C(X,Y)$.

The multivariate normal with dimensionality $d$ has a joint probability density given by:

$$ p(\mathbf{x} \mid \mathbf{\mu}, \Sigma) = \frac{1}{\sqrt{(2\pi)^d \lvert\Sigma\rvert}} e^{\textstyle \left( -\frac{1}{2}(\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}) \right)} $$

Where $\mathbf{x}$ a random vector of size $d$, $\mathbf{\mu}$ is the mean vector, $\Sigma$ is the ( symmetric , positive definite ) covariance matrix (of size $d \times d$), and $\lvert\Sigma\rvert$ its determinant . We denote this multivariate normal distribution as:

$$ \mathcal{N}(\mathbf{\mu}, \Sigma) $$

In [4]:
def multivariate_normal(x, d, mean, covariance):
    """pdf of the multivariate normal distribution."""
    x_m = x - mean
    return 1. / (np.sqrt((2 * np.pi)**d * np.linalg.det(
        covariance))) * np.exp(-(np.linalg.solve(
        covariance, x_m) / 2)

Examples of two bivariate normal distributions are plotted below.

The figure on the left is a bivariate distribution with the covariance between $x_1$ and $x_2$ set to $0$ so that these 2 variables are independent:

$$ \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}\right) $$

The figure on the right is a bivariate distribution with the covariance between $x_1$ and $x_2$ set to be different than $0$ so that both variables are correlated. Increasing $x_1$ will increase the probability that $x_2$ will also increase:

$$ \mathcal{N}\left( \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 & 0.8 \\ 0.8 & 2 \end{bmatrix}\right) $$

In [5]:

Mean and Variance of affine transformations of the multivariate normal

It is possible to transform a multivariate normal distribution into a new normal distribution with an affine transformation . More specificaly if $X$ is normally distributed and $Y = LX + u$ with $L$ a linear transformation and $u$ a vector then $Y$ is also normally distributed with mean $\mu_{Y} = u + L\mu_{X}$ and covariance matrix $\Sigma_{Y} = L\Sigma_{X}L^T$.

$$Y \sim \mathcal{N}(\mu_{Y}, \Sigma_{Y}) \quad\quad X \sim \mathcal{N}(\mu_{X}, \Sigma_{X}) \\ \mathcal{N}(\mu_{Y}, \Sigma_{Y}) = \mathcal{N}(u + L\mu_{X}, L\Sigma_{X}L^T) = L\mathcal{N}(\mu_{X}, \Sigma_{X}) + u$$

This can be proven als follows:

$$\mu_{Y} = \mathbb{E}[Y] = \mathbb{E}[LX + u] = \mathbb{E}[LX] + u = L\mu_{X} + u$$

$$\begin{split} \Sigma_{Y} & = \mathbb{E}[(Y-\mu_{Y})(Y-\mu_{Y})^\top] \\ & = \mathbb{E}[(LX+u - L\mu_{X}-u)(LX+u - L\mu_{X}-u)^\top] \\ & = \mathbb{E}[(L(X-\mu_{X})) (L(X-\mu_{X}))^\top] \\ & = \mathbb{E}[L(X-\mu_{X}) (X-\mu_{X})^\top L^\top] \\ & = L\mathbb{E}[(X-\mu_{X})(X-\mu_{X})^\top]L^\top \\ & = L\Sigma_{X}L^\top \end{split}$$

Sampling from a multivariate normal

The previous formula helps us to sample from any multivariate Guassian .
To do this sampling we can sample $X$ from the standard normal distribution $X \sim \mathcal{N}(0, I_d)$, where the mean is the vector $\mu=0$ and the covariance is the identity matrix $\Sigma=I_d$. Sampling from this distribution is easy because each variable in $X$ is independent from all other variables, we can just sample each variable separately.

It is then possible to sample $Y$ from $\mathcal{N}(\mu_{Y}, \Sigma_{Y})$ by sampling $X$ and applying the affine transform on the samples. This tranform is $Y = LX + u$ where we know from the previous section that the covariance of $Y$ will be $\Sigma_{Y} = L\Sigma_{X}L^\top$. Since $\Sigma_{X}=I_d$ we can write that $\Sigma_{Y} = L I_d L^\top = L L^\top$. $L$ can now be found by a technique called the Cholesky decompostion which does exactly the decomposition we need. The vector $u$ is then $\mu_{Y}$ since $\mu_{X}=0$ ($u = \mu_{Y} - L\mu_{X}$).

Lets try this out by sampling 50 samples from:

$$ Y \sim \mathcal{N}\left( \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 & 0.8 \\ 0.8 & 2 \end{bmatrix}\right) $$

The sampling is done by the following code and the samples are plotted as red dots on the probability density surface below.

In [6]:
# Sample from:
d = 2 # Number of dimensions
mean = np.matrix([[0.], [1.]])
covariance = np.matrix([
    [1, 0.8], 
    [0.8, 2]])
n = 50  # Samples to draw

# Create L
L = np.linalg.cholesky(covariance)
# Sample X from standard normal
X = np.random.normal(size=(d,n))
# Apply the transformation
Y = + mean
In [7]:

Marginal and Conditional normal distributions

If both $\mathbf{x}$ and $\mathbf{y}$ are jointly normal random vectors defined as: $$ \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_{\mathbf{x}} \\ \mu_{\mathbf{y}} \end{bmatrix}, \begin{bmatrix} A & C \\ C^T & B \end{bmatrix} \right) = \mathcal{N}(\mu, \Sigma) , \qquad \Sigma^{-1} = \Lambda = \begin{bmatrix} \tilde{A} & \tilde{C} \\ \tilde{C}^T & \tilde{B} \end{bmatrix} $$

Marginal distribution

The marginal distributions are the univariate distributions of each component $\mathbf{x}$ and $\mathbf{y}$ seperately, and are defined as:

$$ \begin{split} p(\mathbf{x}) & = \mathcal{N}(\mu_{\mathbf{x}}, A) \\ p(\mathbf{y}) & = \mathcal{N}(\mu_{\mathbf{y}}, B) \end{split} $$

Conditional distribution

The conditional distribution of $\mathbf{x}$ given $\mathbf{y}$ is defined as:

$$ p(\mathbf{x} \mid \mathbf{y}) = \mathcal{N}(\mu_{x|y}, \Sigma_{x|y}) $$

With: $$\begin{split} \Sigma_{x|y} & = A - CB^{-1}C^\top = \tilde{A}^{-1} \\ \mu_{x|y} & = \mu_x + CB^{-1}(\mathbf{y}-\mu_y) \end{split}$$

With $A - CB^{-1}C^\top = \tilde{A}^{-1}$ the Schur complement of B in $\Sigma$. (You can find the full proof at R. Wang's notes ).
The computation of the conditional covariance matrix $\Sigma_{x|y}$ can be viewed as inverting the covariance matrix $\Sigma^{-1} = \Lambda$, dropping the rows and columns corresponding to the variables $\mathbf{y}$ that are being conditioned upon (Selecting $\tilde{A}$), and inverting back to get the conditional covariance matrix $\Sigma_{x|y}=\tilde{A}^{-1}$.

The shift of the mean can be seen as getting the residual of the variable conditioned upon $(\mathbf{y}-\mu_y)$, normalising this with the covariance $B$ of the variable conditioned upon, and transforming it to the space of $\mathbf{x}$ by the covariances between $\mathbf{x}$ and $\mathbf{y}$ $(C)$.

The conditional distributions of the example used above are calculated and plotted in the following code:

In [8]:
d = 2  # dimensions
mean = np.matrix([[0.], [1.]])
cov = np.matrix([
    [1, 0.8], 
    [0.8, 2]])

# Get the mean values from the vector
mean_x1 = mean[0,0]
mean_x2 = mean[1,0]
# Get the blocks (single values in this case) from 
#  the covariance matrix
A = cov[0, 0]
B = cov[1, 1]
C = cov[0, 1]  # = C transpose in this case

# Calculate x1|x2
x2_condition = 1.  # To condition on x2

mean_1given2 = mean_x1 + (C * (1/B) * (x2_condition - mean_x2))
cov_1given2 = A - C * (1/B) * C

# Calculate x2|x1
x1_condition = 1.5  # To condition on x1
mean_2given1 = mean_x2 + (C * (1/A) * (x1_condition - mean_x1))
cov_2given1 = B - (C * (1/A) * C)
In [9]: