A Bayesian Approach to \(L^1\) and \(L^2\) Regularization in Machine Learning

Colin Carroll

Spiceworks

Linear Regression

Setup


$$y(\mathbf{x}, \mathbf{w}) = w_0 + x_1 w_1 + \cdots + w_Dx_D$$

Advanced Setup

$$y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{j=1}^n w_j \phi_j(\mathbf{x})$$
$$ y = \phi(\mathbf{x}) \cdot \mathbf{w}$$
$$ \phi(x) = x^2 $$
$$ \phi(x) = \frac{1}{1 + e^{-5x}} $$
$$ \phi(x) = e^{\frac{x^2}{0.5^2}} $$

Loss Function

Given data \(\{(y_1, \mathbf{x}_1),\ldots,(y_D, \mathbf{x}_D)\}\), find weights \(\mathbf{w} = (w_1, \ldots, w_n)\) to minimize $$ loss(\mathbf{w}) = \sum_{j=1}^D (y_j - \mathbf{w} \cdot \phi(\mathbf{x}_j))^2 $$

Linear algebra to the rescue

$$ X = (\mathbf{x}_1, \ldots, \mathbf{x}_D), \mathbf{y} = (y_1, \ldots, y_D)^T $$ Then $$\mathbf{w} = (X^TX)^{-1}X^T\mathbf{y}$$ minimizes $$\|X\mathbf{w} - \mathbf{y}\|_2^2$$

Calculus to the rescue

$$ \nabla_{\mathbf{w}}loss = \sum_{j=1}^D (\mathbf{w} \cdot \phi(x_j) - y_j) \phi(x_j) $$ So iterate the following step until \(\mathbf{w}\) converges: $$\mathbf{w} = \mathbf{w} - \alpha \nabla_{\mathbf{w}}loss$$

A Naive Approach

Degree:
Mean Squared Error: {{errors.train.toFixed(5)}}

A Less Naive Approach

Degree:
Mean Squared Train Error: {{errors.train.toFixed(5)}}
Mean Squared Test Error: {{errors.test.toFixed(5)}}

A Typical Approach

A Typical Approach


        errors = []
        for penalty in (None, 'l1', 'l2'):
        for constant in (0.001, 0.03, 0.1, 0.3, 1):
        model = linear_regression.fit(training_data, penalty, constant)
        errors.append(
        sum(
        model.fit(testing_data.features) - testing_data.labels
        ) ** 2
        )
    

Motivating Loss Function


$$y(\mathbf{x}, \mathbf{w}) = \mathbf{w} \cdot \phi(\mathbf{x}) + \mathscr{N}(0, \sigma^2)$$
The likelihood is given by

$$p(y | \mathbf{x}, \mathbf{w}) \propto \exp{\left(-\frac{(y - \mathbf{w} \cdot \phi(\mathbf{x}))^2}{2 \sigma^2}\right)}$$

Normal Distribution

Why the normal distribution?

Something something something central limit theorem. Analytically tractable. Gives least squares.
{{parameters.trainingPoints}} points drawn from

Motivating Loss Function

$$\mathscr{D} = \{(y_1, \mathbf{x}_1), \ldots, (y_N, \mathbf{x}_N)\}$$
$$ \begin{align} p(\mathscr{D} | \mathbf{w}) &= \prod_{j=1}^N p(y_j | \mathbf{x}_j, \mathbf{w}) \\ & \propto \exp{\sum_{j=1}^N -\frac{(y_j - \mathbf{w} \cdot \mathbf{x_j})^2}{2 \sigma^2}} \end{align} $$

Motivating Loss Function

Maximizing $$\exp{\sum_{j=1}^N -\frac{(y_j - \mathbf{w} \cdot \mathbf{x_j})^2}{2 \sigma^2}}$$ is equivalent to mimimizing $$ \sum_{j=1}^N (y_j - \mathbf{w} \cdot \mathbf{x_j})^2 $$

More priors

What if we had some prior expectations about the weights of the model?

Laplace Distribution

More priors

Using Bayes' theorem $$ p(\mathbf{w} | \mathscr{D}, \tau) \propto p(\mathscr{D} | \mathbf{w}, \tau) p(\mathbf{w}| \tau) $$
$$ \begin{align} -\log{\left(p(\mathbf{w} | \mathscr{D}, \tau)\right)} &\propto -\log{\left(p(\mathscr{D} | \mathbf{w}, \tau)\right)} - \log{\left(p(\mathbf{w}, \tau)\right)} \\ &= \frac{\sum (y_j - \mathbf{w} \cdot \mathbf{x}_j)^2}{2 \sigma^2} - \log{\left(p(\mathbf{w}, \tau)\right)} \end{align} $$

Ridge Regression

$$ p(\mathbf{w}, \tau) \propto \exp{\left(-\frac{\|\mathbf{w}\|_2^2}{2\tau^2}\right)} $$ $$ loss(w) = \|\mathbf{y} - \mathbf{X}\mathbf{w}^T\|^2 + \left(\frac{\sigma}{\tau}\right)^2\|\mathbf{w}\|_2^2 $$

Lasso Regression

$$ p(\mathbf{w}, \tau) \propto \exp{\left(-\frac{\|\mathbf{w}\|_1}{\tau}\right)} $$ $$ loss(w) = \|\mathbf{y} - \mathbf{X}\mathbf{w}^T\|^2 + \frac{2\sigma^2}{\tau}\|\mathbf{w}\|_1 $$