Regularization#

Part 1#

Recap on Bias-Variance Trade-Off#

Bias - Variance Trade-Off#

\[MSE(\hat{y})=Bias^{2}+Variance+Noise\]

→ to minimize the cost we need to find a good balance between the Bias and Variance term of a model
→ we can influence bias and variance by changing the complexity of our model

Note: Noise is the irreducible error of a model. We cannot influence it.

Example: Underfitting vs. Overfitting#

  • We got data. But we don’t know the underlying Data Generating Process.

  • So we want to model it.

  • Do you see a pattern/trend?

../_images/cca5dea07668ab238958fdd753c105858b28c4a7cc2af8e0bc9ba6960306b10c.png

Example: Underfitting vs. Overfitting#

  • Which model seems best?

  • Which model seems to underfit the data?

  • Which model might overfit the data?

  • How to evaluate if a model is underfitting/overfitting?

../_images/58c74ee70b12c0100c49fad486cf3d914ac84491af6ba4887b4addb91517aec2.png

Example: Underfitting vs. Overfitting#

How to evaluate if a model is underfitting/overfitting?

  • we need a cost function

  • we need test data

  • we should do error analysis

../_images/8adc2a46d9a9f983dc974b8f58b790d673da796f064d05dfb3de636cba4ba271.png

Example: Underfitting vs. Overfitting#

How to figure out if your model is overfitting?

../_images/8adc2a46d9a9f983dc974b8f58b790d673da796f064d05dfb3de636cba4ba271.png

Example: Underfitting vs. Overfitting#

How to figure out if your model is overfitting?

  • error on training data is low

  • error on test data is high

→ model memorizes the noise in the data

../_images/8adc2a46d9a9f983dc974b8f58b790d673da796f064d05dfb3de636cba4ba271.png

Example: Underfitting vs. Overfitting#

How to figure out if your model is overfitting?

  • error on training data is low

  • error on test data is high

../_images/8adc2a46d9a9f983dc974b8f58b790d673da796f064d05dfb3de636cba4ba271.png

Part 2#

A Visual Approach#

Prevent Overfitting#

If we see overfitting of our model, we could gather more data.

Prevent Overfitting#

If we see overfitting of our model, we could reduce its complexity.

HOW?

Note: Every model type has a way to reduce model complexity.
We just learnt Linear Reg. that’s why we will concentrate on the Regularization of those models for now.
../_images/8adc2a46d9a9f983dc974b8f58b790d673da796f064d05dfb3de636cba4ba271.png

Prevent Overfitting#

If we see overfitting of our model, we could reduce its complexity.

HOW?

  • reduce amount of features

../_images/f7f4498d3f718836d50fa4e9360f80896f61ce958fd1bb2988caf44081ea3d82.png

Prevent Overfitting#

If we see overfitting of our model, we could reduce its complexity.

HOW?

  • reduce amount of features

  • make the model less susceptible to data by reducing the influence of features
    → smaller coefficients

../_images/e7ab04444b62b3d8887feae6beafb3d4fc6698436e7eeec853199f5c9c7bf3ab.png

Prevent Overfitting with Regularization#

BOTH can be achieved with regularizing a model:

  • reduce amount of features

  • make the model less susceptive of data by reducing the influence of features (smaller coefficients)

Part 3#

Regularization#

Regularization#

Regularization conceptually uses a constraint to prevent model coefficients from getting too large, at a small cost in overall accuracy. With the aim to get models that generalize better on new data.

Note: Even with linear models, it can be useful to regularise them. Because they have a tendency to trace outliers in the training data.

Soft constraint#

We can add this constraint directly to our Loss function (t becomes alpha (or lambda))

\[\begin{split}\begin{align} Ridge (L2): J(b)&=\frac{1}{n}\sum{\big(y-(b_{0}+b_{1}x_{1}+b_{2}x_{2})\big)}^{2} + \alpha\,(b_{1}^2+b_{2}^2) \\ Lasso (L1): J(b)&=\frac{1}{n}\sum{\big(y-(b_{0}+b_{1}x_{1}+b_{2}x_{2})\big)}^{2} + \alpha\,(|b_{1}|+|b_{2}|) \end{align}\end{split}\]

Alpha is a hyperparameter. Before training the model we need to set it.

Soft constraint#

We can add this constraint directly to our Loss function (t becomes alpha (or lambda))

\[\begin{split}\begin{align} Ridge (L2): J(b)&=\frac{1}{n}\sum{\big(y-(b_{0}+b_{1}x_{1}+b_{2}x_{2})\big)}^{2} + \alpha\,(b_{1}^2+b_{2}^2) \\ Lasso (L1): J(b)&=\frac{1}{n}\sum{\big(y-(b_{0}+b_{1}x_{1}+b_{2}x_{2})\big)}^{2} + \alpha\,(|b_{1}|+|b_{2}|) \end{align}\end{split}\]

What happens if we set alpha to 0?
What happens if we set alpha to a very high value?

Sklearn code for regularization#

check sklearn documentation here

ridge_mod = Ridge(alpha=1.0)  #adjust the alpha level
ridge_mod.fit(X, y)
ridge_mod.predict(X)

Some different alpha values#

We have to test some values for alpha and check which give us best results on unseen data

print(f'the MSE for Ridge regularization and alpha = 0.5 is:', get_mse(Ridge,X,y, polynomial=True, alpha=0.5))
print(f'the MSE for Lasso regularization and alpha = 0.5 is: ',get_mse(Lasso,X,y, polynomial=True, alpha=0.5))
the MSE for Ridge regularization and alpha = 0.5 is: 0.2957168909631701
the MSE for Lasso regularization and alpha = 0.5 is:  0.4647391677603675
../_images/fd9d8f57079812b32430ff7d8c481a6049ba73098f0687ace35037e8b006fe7f.png

Ridge Regression#

  • Also called L2 Regularization / l2 norm

  • the regularization term forces the parameter estimates to be as small as possible

Lasso Regression#

Least Absolute Shrinkage and Selection Operator

  • Also called L1 Regression / l1 norm

  • Tends to eliminate weights = it automatically performs feature selection

\[J(b)=\frac{1}{n} \sum{(y-\hat{y}(b))^2}+\alpha\sum{|b_{i}|}\]

Ridge vs Lasso#

../_images/684775b6940158c6051ee5bdc8abc0b21ca4ab37fb00c400e023ba144557661d.png

Elastic Net - Mixing Lasso and Ridge#

  • Regularization term is weighted average of Ridge and Lasso Regularization term

  • When r = 0 it is equivalent to Ridge, if r = 1 it is equivalent to Lasso Regression

  • Preferable to Lasso when features are highly correlated or to Ridge for high-dimensional data (more features than observations)

\[J(b)=\frac{1}{n}\sum{(y-\hat{y}(b))}^{2}+\alpha\,r \sum{|b_{i}|+\alpha\,(1-r)} \sum{b_{i}^2}\]

References#