Example

Linear Regression

Typical framing

  1. $y = X\beta + \epsilon$ -- Target = data * params + error

  2. $\epsilon_i ~ \mathcal{N}(0, \sigma^2)$ -- Error is a noise function centered on 0

  3. $y_i ~ \mathcal{N}(\beta^Tx_i, \sigma^2$

Regularization

  • Added because $\hat{\beta}$ for ML tends to have high variance, so small change in $x$ results in large change in $\beta$ (bad)

  • $\text{LASSO: } \hat{\beta}_{L1} = \argmin_\beta[\lVert {y-X\beta}_2^2 \rVert + \lambda \lVert \beta \rVert_1]$

  • $\text{RIDGE: } \hat{\beta}_{L2} = \argmin_\beta[\lVert {y-X\beta}_2^2 \rVert + \lambda \lVert \beta \rVert_2^2]$

But where did regularization come from?

Bayesian Approach

Create a MAP estimate

$$ \hat{\beta}_{MAP} = \argmax_\beta[P(\beta|y)] = \argmax_\beta[\frac{P(y|\beta)P(\beta)}{P(y)}] = \argmax_\beta[P(y|\beta)P(\beta)] $$

Rearrange to matrix form

  • $y_i ~ \mathcal{N}(\beta^Tx_i, \sigma^2)$

  • $P(y|\beta) = \prod_{i=1}^N{\frac{1}{\sigma\sqrt{2\pi}}e}$

Choose a Gaussian Prior -- Get L2 Norm

$\argmin_\beta[\lVert y-X\beta \rVert_2^2 + \frac{\sigma^2}{\tau^2} \lVert \beta \rVert_2^2]$

Substitute $\lambda = \frac{\sigma^2}{\tau^2}$:

$\argmin_\beta[\lVert y-X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2] = \hat{\beta}_{L2}$

Choose a Laplacian Prior -- Get L1 Norm