Understanding Regularization in Machine Learning: Ridge, Lasso, and Elastic Net

A machine learning model learns over the data it is trained and should be able to generalize well over it. When a new data sample is introduced, the model should be able to yield satisfactory results. In practice, a model sometimes performs too well on the training set, however, it fails to perform well on the validation set. This model is then said to be overfitting. Contrarily, if the model performs poorly on the training data itself, it is then said to be underfitting.

The solution to an underfitting model is simple: the model is made more complex such as adding more layers or increasing the degree of the polynomial (in polynomial regression), so that the model learns more on the training data. Whereas when overfitting is involved, one of the solutions is data augmentation if the data is spatial data such as images. In the case of simple models such as linear regression, overfitting can be tackled by penalizing the model during the training process. This process to tackle the problem of overfitting is called regularization.

In this blog, we will understand how to penalize the model during training so that it avoids overfitting. We will learn about three types of regularized linear regression. We will then understand how and, more importantly, why regularization works. Ultimately, we will discuss which type of regularization techniques should be used in what case.

Machines are getting smarter and so must you! Get informed with interesting posts on programming and machine learning, and learn something new regularly on Instagram Plus, there are memes too, because who doesn’t like memes? It’s absolutely FREE. Check it out: @machinelearningsite.

Ridge Regression (L2- Regularization)

Ridge regression adds a penalty to the size of the coefficients in the regression model. It minimizes the sum of squared residuals while also adding a penalty term proportional to the sum of the squared coefficients. One major benefit of ridge regression is that it keeps the weights as small as possible. This is because during gradient descent when the derivatives are calculated, this regularization term acts as an additional penalty which further reduces the weights than standard.

Here, β is the coefficient for a feature j. λ (regularization parameter) controls the amount of shrinkage applied to the coefficients. A higher λ means more regularization but too high value of λ can lead to underfitting. Below is a small Python code example to try out ridge regression:

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Ridge Regression
ridge = Ridge(alpha=1.0)  # Alpha is the regularization strength (λ)
ridge.fit(X_train, y_train)

# Predictions and model score
y_pred = ridge.predict(X_test)
print("Ridge Regression Coefficients:", ridge.coef_)
print("Model Score:", ridge.score(X_test, y_test))

'''
[Output]:
Ridge Regression Coefficients: [59.87954432 97.15091098 63.24364738 56.31999433 35.34591136]
Model Score: 0.9997951390892165
'''

Lasso Regression (L1- Regularization)

Lasso regression adds a penalty equal to the absolute value of the magnitude of the coefficients. Unlike Ridge regression, Lasso can reduce some coefficients to exactly zero, thus performing feature selection.

from sklearn.linear_model import Lasso

# Apply Lasso Regression
lasso = Lasso(alpha=0.1)  # Alpha is the regularization strength (λ)
lasso.fit(X_train, y_train)

# Predictions and model score
y_pred = lasso.predict(X_test)
print("Lasso Regression Coefficients:", lasso.coef_)
print("Model Score:", lasso.score(X_test, y_test))

'''
[Output]:
Lasso Regression Coefficients: [60.50305581 98.52475354 64.3929265  56.96061238 35.52928502]
Model Score: 0.999996831795937
'''

Elastic Net

Elastic Net combines the penalties of both Ridge (L2) and Lasso (L1) regression. It’s useful when you have many features, and some of them are highly correlated. Elastic Net seeks a middle ground, using both the L1 and L2 penalties to create a balanced model.

λ1 controls the Lasso (L1) penalty.
λ2 controls the Ridge (L2) penalty.

from sklearn.linear_model import ElasticNet

# Apply Elastic Net Regression
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio controls the mix of Lasso and Ridge
elastic_net.fit(X_train, y_train)

# Predictions and model score
y_pred = elastic_net.predict(X_test)
print("Elastic Net Regression Coefficients:", elastic_net.coef_)
print("Model Score:", elastic_net.score(X_test, y_test))

'''
[Output]:
Elastic Net Regression Coefficients: [57.82265466 92.85730675 59.52004537 54.16738219 34.50996407]
Model Score: 0.9969562697153884
'''

Why Penalizing the Model Through Regularization Works

In most machine learning problems, we aim to build a model that can generalize well to unseen data. However, if the model is too flexible or complex, it might overfit the training data. It means that the model learns not only the true patterns in the data but also the noise or random fluctuations specific to the training set. While this might result in high accuracy on the training data, the model often performs poorly on new, unseen data because it hasn’t learned the true underlying relationships.

For example, in a linear regression problem, a highly flexible model might assign large coefficients to features, fitting the training data almost perfectly. However, this can cause the model to become sensitive to small variations in the input data, leading to high variance and poor generalization.

Regularization improves the generalization of a model by preventing it from focusing too much on the training data’s specific patterns. A model with small coefficients is less sensitive to the variations in the data and is less likely to overfit. The regularization parameter λ (also known as alpha in some implementations) controls the strength of this penalty. By tuning λ, we can control how much we penalize large coefficients.

In Ridge regression (L2), the penalty term is the sum of the squared coefficients, which tends to shrink all coefficients but doesn’t remove any feature completely. This works well when all features contribute to the outcome but to varying degrees. In contrast, Lasso regression (L1) uses the sum of the absolute values of the coefficients as the penalty, which can force some coefficients to be exactly zero, effectively performing feature selection.

When to Use Each Regularization Technique?

The fundamental principle of all the regularization techniques is the same, i.e., penalize the model in case it overfits the data. However, depending on the number of samples and features in the training data, choosing the correct technique can be a deciding factor in the model performance:

Ridge Regression: Use when you want to shrink coefficients but retain all features in the model. Works well when features are highly correlated.
Lasso Regression: Use when you want automatic feature selection. It’s good for sparse models where only a few features contribute to the outcome.
Elastic Net: Use when you have many features and expect multicollinearity, but also want some form of feature selection.

Summary

Regularized regression helps prevent overfitting, which occurs when a model becomes too complex and captures noise in the training data, leading to poor performance on unseen data. Regularization addresses this by adding a penalty to the model’s loss function, based on the size of its coefficients.

In Ridge regression (L2), this penalty is proportional to the squared coefficients, shrinking them but retaining all features. Lasso regression (L1), on the other hand, uses the absolute values of the coefficients, which can reduce some to zero, effectively performing feature selection. Elastic Net combines the benefits of both L1 and L2 regularization, providing a balanced approach, especially when dealing with correlated and sparse features.

By simplifying the model, regularization improves generalization and helps the model perform better on new data. The regularization strength is controlled by a parameter λ, which can be tuned to balance the trade-off between bias and variance, promoting a more robust and reliable solution.

If you enjoyed this blog, do leave a follow on all my social media. Every single addition to my follower’s list gives me significant amount of joy and motivation to continue publishing such interesting and helpful content.