Regression quattro stagioni
Linear regression implemented four different ways
Linear regression is one of the simplest machine learning models out there. According to some, it shouldn't even be classified as "machine learning" because it's too simple. Whatever your opinion, I believe its simplicity makes it a great stepping stone to more complex models and algorithms. Inspired by the classical pizza dish, we will explore the linear regression model with four different flavours of estimating its parameters. I hope this helps build our understanding of this basic model. The four different "seasons" we'll explore are:
- Simple linear regression
- Ordinary least squares regression (OLS)
- Gradient descent
- MCMC parameter estimation with Metropolis-Hastings
We'll start of by describing the problem of linear regression and the toy data we'll use, followed by implementations of all four methods.
# Imports
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
import warnings
import numpy as np
from numpy import log, exp, sqrt, pi, e
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm # Colormaps
import seaborn as sns
sns.set_style('darkgrid')
warnings.simplefilter(action='ignore', category=FutureWarning)
#
Linear regression problem description
Simple linear regression
will try to fit a linear relationship (straight line) between an independent variable
For
With:
-
the independent (input) variable of sample , with . -
the dependent (output) variable of sample , with . -
Gaussian noise affecting the output . -
the set of the parameters we want to optimize:-
the intercept of the model (bias) -
the coefficient of variable (slope)
-
We can use this discription to define our toy data sample:
# Define the data
np.random.seed(4)
# Generate random data
n = 50 # Number of samples
# Underlying linear relation
m = 3 # slope
b = 2 # bias
# Noise
e_std = 0.5 # Standard deviation of the noise
err = e_std * np.random.randn(n) # Noise
# Features and output
x = np.random.uniform(0, 1, n) # Independent variable x
y = x * m + b + err # Dependent variable
# Stack X with ones to be fitted by vectorized methods
# such as OLS and gradient descent
X = np.vstack((np.ones_like(x), x)).T
# Show data
plt.figure(figsize=(6, 4))
plt.scatter(x, y, label='data: $(x,y)$')
plt.plot(
[0, 1], [b, m+b], 'b-',
label=f'$y = {b:.0f} + {m:.0f} x$')
plt.xlim((0, 1))
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.title('Noisy data samples from linear line')
plt.legend()
plt.show()
#
Probabilistic description
An alternative description is the
probabilistic
description of the linear regression problem where we treat the dependent variable
Note that the mean
Given the observed data
With:
-
the posterior density which is our belief of parameters after taking the observed data into account. -
is the likelihood function which is the probability density of the normal distribution described above. -
is the prior density of , which captures our belief about parameters before we observed any data. Note that we assume that is independent of and that . -
is the marginal likelihood of the data where has been marginalized out according to .
Note that the error
Maximum a posteriori estimation (MAP)
We can find a
point estimate
This can be simplified this since the marginal likelihood
This is also known as
maximum a posteriori (MAP)
estimate. In practice most methods to fit the parameters
Maximum likelihood estimation (MLE)
Deviating from the Bayesian perspective, some of the models will ignore the prior
The resulting optimization is what is known as
maximum likelihood estimation (MLE)
, which will focus only on the
likelihood function
In our regression example if we assume that all samples
In what follows we will describe 4 different algorithms that can be used to fit a linear regression model.
1. Simple linear regression
Given the density function above we can find the estimated parameters
Because the exponents of the density function are difficult to work with and can be numerically unstable we will transform the function with a logarithmic function which is strictly increasing and will thus not affect the maxima of the original likelihood function. We call the resulting function the log-likelihood :
We can then find the maximum for this function by finding where the derivative of this function with respect to its parameters becomes 0. If we solve this for our parameters we find that:
with
# Simple linear regression
# Compute mean
x_mean = np.mean(x)
y_mean = np.mean(y)
# Compute parameters according to simple linear regression
theta_1 = (
(x-x_mean) * (y-y_mean)).sum() / ((x-x_mean)**2).sum()
theta_0 = y_mean - theta_1 * x_mean
# Function representing fitted line
f = lambda x: theta_0 + theta_1 * x
# Show simple linear regression fitted line
plt.figure(figsize=(6, 4))
plt.plot(
[0, 1], [f(0), f(1)], 'r-',
label=f'$y = {theta_0:.2f} + {theta_1:.2f} x$')
plt.scatter(x, y, label='data $(x,y)$')
plt.legend()
plt.title('Simple linear regression fit')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim((0, 1))
plt.show()
#
2. Ordinary least squares regression
Note that in the simple linear regression formulation above we end up maximising the log-likelihood which has the term
Our optimization problem then becomes:
This is what is typically known as ordinary least squares (OLS) since we are finding the solution with the least deviance in the square of the residuals (error), which is also known as least squares . Note that in our example minimizing the least squares is the same as maximizing the likelihood (MLE).
We can solve this by setting the
Jacobian
(derivative) of
When we drop the
With
The OLS method leads to more advanced regression methods such as polynomial regression . The OLS fitting example on our linear toy data is implemented below:
# Ordinary least squares
# Fit parameters with OLS
Theta = np.linalg.inv(X.T @ X) @ X.T @ y
# Function representing fitted line
f = lambda x: Theta[0] + Theta[1] * x
# Show OLS fitted line
plt.figure(figsize=(6, 4))
plt.plot(
[0, 1], [f(0), f(1)], 'r-',
label=f'$y = {Theta[0]:.2f} + {Theta[1]:.2f} x$')
plt.scatter(x, y, label='data $(x,y)$')
plt.legend()
plt.title('Ordinary least squares regression fit')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim((0, 1))
plt.show()
#
3. Gradient descent optimization
OLS assumes a unique solution to the least squares maximisation problem and will not work in more complex non-linear regression methods such as neural networks where our optimization surface is not convex anymore. In this case we can resort to stochastic gradient descent to optimize the loss function.
Next we'll show how to optimize our parameters with regular
gradient descent
which can be made stochastic by breaking it down in random batches of size
We're optimizing the same loss function as in OLS, which means we're maximizing the log-likelihood:
Now because we want to minimize this loss function we can just take small steps towards the minimum of this function, which lies in the oposite direction of the gradient. We saw that the gradient (Jacobian, or derivative) of this loss function was:
This means that if we start out with random parameters
The learning rate
Note that in this case the gradient descent optimizer acts as a maximum likelihood estimator (MLE). Gradient descent and it's variants can be used to optimize complex neural network functions [ 1, 2 ]. A simple implementation of gradient descent is provided below.
# Gradient descent update
def update(Theta, X, y, learning_rate):
"""Update the parameters with one iteration."""
grad = - 2 * X.T @ (y - (X @ Theta)) # Compute gradient
Theta -= grad * learning_rate # Update paramters
return Theta
# Implementation of gradient descent
# Random initial parameters
np.random.seed(13)
Theta = np.random.randn(2)
learning_rate = 0.01 # Learning rate
losses = [] # Will hold the loss (cost) per iteration
params = [] # Will hold the Theta params per iteration
# Run gradient descent for a number of steps
for i in range(50):
Theta = update(Theta, X, y, learning_rate)
# Store iterations for visualisation
losses.append(np.mean((y-(X@Theta))**2))
params.append(Theta.copy())
# Function representing fitted line
f = lambda x: Theta[0] + Theta[1] * x
# Show fitted line
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9.5, 4))
ax1.plot(
[0, 1], [f(0), f(1)], 'r-',
label=f'$y = {Theta[0]:.2f} + {Theta[1]:.2f} x$')
ax1.scatter(x, y, label='data $(x, y)$')
ax1.legend()
ax1.set_title('Gradient descent fit')
ax1.set_xlabel('$x$')
ax1.set_ylabel('$y$')
ax1.set_xlim((0, 1))
# Show loss surface
nb_of_ws = 50 # compute the loss nb_of_ws times in each dimension
wsa = np.linspace(0, 4, num=nb_of_ws) # theta_0
wsb = np.linspace(1, 5, num=nb_of_ws) # theta_1
ws_x, ws_y = np.meshgrid(wsa, wsb) # generate grid
loss_ws = np.zeros((nb_of_ws, nb_of_ws)) # initialize loss matrix
# Fill the loss matrix for each combination of weights
for i in range(nb_of_ws):
for j in range(nb_of_ws):
theta = np.asarray([ws_x[i,j], ws_y[i,j]])
loss_ws[i,j] = np.mean((y - (X @ theta))**2)
# Plot the loss function surface
cont = ax2.contourf(ws_x, ws_y, loss_ws, 20, cmap=cm.viridis)
# Plot the parameter updates
theta_prev = None
has_label = False
params_arr = np.vstack(params)
# Plot the param-loss values that represents the update
ax2.scatter(params_arr[:,0], params_arr[:,1], color='w', s=2)
ax2.plot(params_arr[:,0], params_arr[:,1], 'w-', label='gradient descent steps')
ax2.scatter(
[Theta[0]], [Theta[1]], label='Final parameter $\\hat{\\Theta}$',
color='w', marker='*', s=150, edgecolors='k')
cbar = plt.colorbar(cont)
cbar.set_label('X+Y')
cbar.ax.set_ylabel(
'$\\left\\Vert Y-X\\Theta \\right\\Vert^2$', fontsize=12)
ax2.set_title((
'$\\Theta(k)$ updates on loss surface '
'$\\left\\Vert Y-X\\Theta \\right\\Vert^2$'))
ax2.set_xlabel('$\\Theta_0$')
ax2.set_ylabel('$\\Theta_1$')
ax2.legend()
plt.show()
#
4. Markov Chain Monte Carlo parameter estimation
An alternative to optimizing the likelihood function is to
approximate
the full posterior by sampling based on
Markov Chain Monte Carlo (MCMC)
methods. One such sampling algorithm is the
Metropolis-Hastings algorithm
which allows us to sample from the posterior
Remember that the posterior density of
Sampling with Metropolis-Hastings
The Metropolis-Hasting algorithm defines a way to create a
Markov chain
which has as
stationary distribution
the posterior
Given an initial guess for parameters
-
Propose a new parameter value
by adding random noise to the the initial parameters:
-
Find out how much more likely the new proposed parameter
is with respect to the previous parameter by calculating the ratio:
-
Then accept the proposal
as the new parameter if the ratio , or if accept as new parameter with probability . If not accepted keep . -
To keep sampling repeat from step 1.
For simplicity we will be ignoring the prior
If we want to add a specific prior this should be relatively simple and we shoud just multiply an extra term when calculating the ratio
To avoid numerical underflow when probabilities are small we will be performing the computations in the log space where the multiplications will become additions.
Parameter estimation
After sampling we can find best point estimates of the parameter by Bayesian estimation by taking the expected value ( mean ) of the sampled parameters. In practice we only compute the mean over the last samples because MCMC needs the first steps to converge towards the stationary distribution. A simple implementation on our toy data is provided below.
MCMC might be overkill for our simple toy problem, however, it can be used to approximate the posterior of complex distributions such as multilevel models . Powerfull software exists to sample from these complex methods more efficiently [ 3 ].
def metropolis_hastings_step(y, x, theta_0, theta_1, sigma):
"""Single step of the Metropolis-Hastings algorithm."""
# Run until new proposed sample has been accepted
while True:
theta_0_p = theta_0 + np.random.randn(1) * 0.1
theta_1_p = theta_1 + np.random.randn(1) * 0.1
sigma_p = sigma + np.random.randn(1) * 0.1
log_ratio = (
log_likelihood(y, x, theta_0_p, theta_1_p, sigma_p) -
log_likelihood(y, x, theta_0, theta_1, sigma))
if log_ratio >= e:
# Ratio >= 1: accept proposal
return theta_0_p, theta_1_p, sigma_p
elif log(np.random.uniform()) < log_ratio:
# Ratio < 1: check with probability == ratio
return theta_0_p, theta_1_p, sigma_p
def log_likelihood(y, x, theta_0, theta_1, sigma):
"""Log-likelihood function."""
sigma_sq = sigma**2
# Sum of squared errors
sse = sum((y - (theta_0+theta_1*x)) ** 2)
return - len(y) * log(sqrt(2*pi*sigma_sq)) \
- (sse/(2*sigma_sq))
# Implementation of metropolis-Hasting MCMC
np.random.seed(1)
# Initial parameters
theta_0 = 1.
theta_1 = 1.
sigma = 1.
# Run Metropolis-Hastings MCMC
samples = [] # List of all samples
for _ in range(3000):
# Generate new parameter proposals
theta_0, theta_1, sigma = metropolis_hastings_step(
y, x, theta_0, theta_1, sigma)
samples.append((theta_0, theta_1, sigma))
th_0_samples, th_1_samples, sigma_samples = zip(*samples)
# Compute the expected value of the last samples
# The first 1000 samples are ignored because the MCMC has not
# yet converged
theta_0 = np.mean(th_0_samples[1000:])
theta_1 = np.mean(th_1_samples[1000:])
# Function representing fitted line
f = lambda x: theta_0 + theta_1 * x
# Show fitted line
%config InlineBackend.figure_formats = ['retina']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9,4), dpi=100)
ax1.plot(
[0, 1], [f(0), f(1)], 'r-',
label=f'$y = {theta_0:.2f} + {theta_1:.2f} x$')
ax1.scatter(x, y, label='data $(x,y)$')
ax1.legend()
ax1.set_title('MCMC mean fit')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_xlim((0, 1))
ax2.scatter(
th_0_samples[1000:], th_1_samples[1000:],
alpha=0.3, s=5, label='MCMC samples')
ax2.scatter(
[theta_0], [theta_1], label='Expected value $\\hat{\\theta}$',
s=150, marker='*', edgecolors='k')
ax2.set_title('Samples from $p(\\theta_0, \\theta_1 \\mid x,y)$')
ax2.set_xlabel('$\\theta_0$')
ax2.set_ylabel('$\\theta_1$')
ax2.legend()
plt.show()
#
Please note that the Metropolis-Hasting sampling algorithm used here can be quite inefficient. A lot of proposed samples will be rejected to get to our desired sample size. More efficient sampling methods such as Hamiltonian Monte Carlo exist and are readily available in tools such as Stan , PyMC3 , and NumPyro .
Further readings
- Gradient descent optimization in neural networks .
- Popular tools that can automatically give you the gradients to use in your optimization are TensorFlow , PyTorch , and Jax .
- Popular probabilistic programming tools supporting powerful sampling methods areas Stan , PyMC3 , and NumPyro .
# Python package versions used
%load_ext watermark
%watermark --python
%watermark --iversions
#
This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file
Linear Regression MCMC Probability Gradient Descent Least Squares Metropolis-Hastings Machine Learning Notebook