Bayesian Linear Regression

Linear Regression

Regression task: will predict a real-valued output
Don't be fooled by the word linear: it's actually cool
Quite useful on its own, and forms the basis for methods with sexier names

Setup

$$y(\mathbf{x}, \mathbf{w}) = w_0 + x_1 w_1 + \cdots + w_Dx_D$$

$\mathbf{x} = (x_1, \ldots, x_D) \in \mathbb{R}^D $ are called features
$\mathbf{w} = (w_0, \ldots, w_D) \in \mathbb{R}^D $ are called weights

Advanced Setup

$$y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{j=1}^n w_j \phi_j(\mathbf{x})$$

$\phi_j: \mathbb{R}^D \to \mathbb{R}$ is called a basis function
$\phi_0(\mathbf{x}) := 1$, so that

$$ y = \phi(\mathbf{x}) \cdot \mathbf{w}$$

$$ \phi(x) = x^2 $$

$$ \phi(x) = \frac{1}{1 + e^{-5x}} $$

$$ \phi(x) = e^{\frac{x^2}{0.5^2}} $$

Loss Function

Given data $\{(y_1, \mathbf{x}_1),\ldots,(y_D, \mathbf{x}_D)\}$, find weights $\mathbf{w} = (w_1, \ldots, w_n)$ to minimize $$ loss(\mathbf{w}) = \sum_{j=1}^D (y_j - \mathbf{w} \cdot \phi(\mathbf{x}_j))^2 $$

Linear algebra to the rescue

$$ X = (\mathbf{x}_1, \ldots, \mathbf{x}_D), \mathbf{y} = (y_1, \ldots, y_D)^T $$ Then $$\mathbf{w} = (X^TX)^{-1}X^T\mathbf{y}$$ minimizes $$\|X\mathbf{w} - \mathbf{y}\|_2^2$$

Calculus to the rescue

$$ \nabla_{\mathbf{w}}loss = \sum_{j=1}^D (\mathbf{w} \cdot \phi(x_j) - y_j) \phi(x_j) $$ So iterate the following step until $\mathbf{w}$ converges: $$\mathbf{w} = \mathbf{w} - \alpha \nabla_{\mathbf{w}}loss$$

A Naive Approach

Create a bunch of basis functions
Fit model with very low error
Realize model overfits data

Degree:

Mean Squared Error: {{errors.train.toFixed(5)}}

A Less Naive Approach

Split data into training and testing sets
Create a bunch of basis functions
Fit model on subset of basis functions and the training data
Choose model with smallest testing error

Degree:

Mean Squared Train Error: {{errors.train.toFixed(5)}}
Mean Squared Test Error: {{errors.test.toFixed(5)}}

A Typical Approach

Split data into training and testing sets
Create a bunch of basis functions
Choose weights to minimize either squared training error or $$ \sum_{j=1}^D (y_j - \mathbf{w} \cdot \phi(\mathbf{x}_j))^2 + C\|\mathbf{w}\|_2^2 $$ or $$ \sum_{j=1}^D (y_j - \mathbf{w} \cdot \phi(\mathbf{x}_j))^2 + C\|\mathbf{w}\|_1 $$

A Typical Approach


        errors = []
        for penalty in (None, 'l1', 'l2'):
        for constant in (0.001, 0.03, 0.1, 0.3, 1):
        model = linear_regression.fit(training_data, penalty, constant)
        errors.append(
        sum(
        model.fit(testing_data.features) - testing_data.labels
        ) ** 2
        )

Motivating Loss Function

$$y(\mathbf{x}, \mathbf{w}) = \mathbf{w} \cdot \phi(\mathbf{x}) + \mathscr{N}(0, \sigma^2)$$

The likelihood is given by

$$p(y | \mathbf{x}, \mathbf{w}) \propto \exp{\left(-\frac{(y - \mathbf{w} \cdot \phi(\mathbf{x}))^2}{2 \sigma^2}\right)}$$

Normal Distribution

$\mathscr{N}(x | \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$
$\mu$ is the mean, $\sigma$ is the standard deviation (these are theorems, not definitions)
$\mathscr{N}(\mu, \sigma)$ is a random variable with mean $\mu$, variance $\sigma^2$

Why the normal distribution?

Something something something central limit theorem. Analytically tractable. Gives least squares.

Motivating Loss Function

$$\mathscr{D} = \{(y_1, \mathbf{x}_1), \ldots, (y_N, \mathbf{x}_N)\}$$

$$ \begin{align} p(\mathscr{D} | \mathbf{w}) &= \prod_{j=1}^N p(y_j | \mathbf{x}_j, \mathbf{w}) \\ & \propto \exp{\sum_{j=1}^N -\frac{(y_j - \mathbf{w} \cdot \mathbf{x_j})^2}{2 \sigma^2}} \end{align} $$

Motivating Loss Function

Maximizing $$\exp{\sum_{j=1}^N -\frac{(y_j - \mathbf{w} \cdot \mathbf{x_j})^2}{2 \sigma^2}}$$ is equivalent to mimimizing $$ \sum_{j=1}^N (y_j - \mathbf{w} \cdot \mathbf{x_j})^2 $$

More priors

What if we had some prior expectations about the weights of the model?

If we expect the weights to all be fairly small, we might write $\mathbf{w} \sim \mathscr{N}(0, \tau)$, where $\tau$ is a measure of how small we expect the weights to be.
We might also write $\mathbf{w} \sim \text{Laplace}(0, \tau)$, where $\tau$ is again again a scale parameter.

Laplace Distribution

$\text{Laplace}(x | \mu, \tau) = \frac{1}{2\tau}e^{-\frac{|x - \mu|}{\tau}}$
$\mu$ is the mean, $\tau\sqrt{2}$ is the standard deviation

More priors

Using Bayes' theorem $$ p(\mathbf{w} | \mathscr{D}, \tau) \propto p(\mathscr{D} | \mathbf{w}, \tau) p(\mathbf{w}| \tau) $$
$$ \begin{align} -\log{\left(p(\mathbf{w} | \mathscr{D}, \tau)\right)} &\propto -\log{\left(p(\mathscr{D} | \mathbf{w}, \tau)\right)} - \log{\left(p(\mathbf{w}, \tau)\right)} \\ &= \frac{\sum (y_j - \mathbf{w} \cdot \mathbf{x}_j)^2}{2 \sigma^2} - \log{\left(p(\mathbf{w}, \tau)\right)} \end{align} $$

Ridge Regression

$$ p(\mathbf{w}, \tau) \propto \exp{\left(-\frac{\|\mathbf{w}\|_2^2}{2\tau^2}\right)} $$ $$ loss(w) = \|\mathbf{y} - \mathbf{X}\mathbf{w}^T\|^2 + \left(\frac{\sigma}{\tau}\right)^2\|\mathbf{w}\|_2^2 $$

Lasso Regression

$$ p(\mathbf{w}, \tau) \propto \exp{\left(-\frac{\|\mathbf{w}\|_1}{\tau}\right)} $$ $$ loss(w) = \|\mathbf{y} - \mathbf{X}\mathbf{w}^T\|^2 + \frac{2\sigma^2}{\tau}\|\mathbf{w}\|_1 $$

A Bayesian Approach to \(L^1\) and \(L^2\) Regularization in Machine Learning

Colin Carroll

Spiceworks

Linear Regression

Setup

Advanced Setup

Loss Function

Linear algebra to the rescue

Calculus to the rescue

A Naive Approach

A Less Naive Approach

A Typical Approach

A Typical Approach

Motivating Loss Function

Normal Distribution

Why the normal distribution?

Motivating Loss Function

Motivating Loss Function

More priors

Laplace Distribution

More priors

Ridge Regression

Lasso Regression