MSE vs RMSE: Which is the Right Error Metric for Machine Learning?

When entering into the world of machine learning, one of the first concepts you encounter is error metrics or cost functions. These metrics help us understand the model performance by quantifying the difference between predicted values (commonly known as hypothesis) and actual values. Two commonly used metrics are Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). At first glance, these two might seem very similar—after all, RMSE is just the square root of the former. However, they serve different purposes and have distinct characteristics.

mse

In this blog post, we’ll explore why Mean Squared Error might be considered a better cost function than RMSE for certain tasks. We’ll delve into the nuances of both metrics, provide examples to illustrate their differences and conclude with some practical advice on when to use one over the other.

Understanding the Fundamentals

Before we start with the comparison, it’s important to have a proper understanding of what each metric means.

Mean Squared Error

MSE is calculated by taking the average of the squared differences between the hypothesis (predicted value) and ground truth (actual values). Mathematically, it’s represented as:

\(
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i – y_i )^2
\end{equation}
\)

The squaring of errors serves two purposes here: first, the high error rate in case of outliers is squared, thereby making the model more sensitive to outliers. Second, the result of the metric is always a non-zero number.

Root Mean Squared Error

One mathematical step further of calculating the square root in the above equation yields RMSE. Mathematically, it can be described as:

\(
\begin{equation}
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i – y_i)^2}
\end{equation}
\)

Despite this minor mathematical difference between the two metrics, the difference in performance is significant as we will see in the following section.

Why is MSE relatively better than RMSE?

Now we will see why mean square error can be considered a relatively better metric when it comes to training machine learning models:

1. Linearity

During the training phase of a machine learning model, the model weights are optimized by calculating the derivative of the function w.r.t. the concerned parameter. Hence, in the case where the error metric is MSE, its derivative w.r.t. the hypothesis can be formulated as:

\(
\begin{equation}
\frac{d \text{MSE}}{d \hat{y}_i} = \frac{2}{n} (\hat{y}_i – y_i)
\end{equation}
\)

Doing the same for RMSE, we get the following equation:

\(
\begin{equation}
\frac{d \text{RMSE}}{d \hat{y}_i} = \frac{1}{2} \left(\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i – y_i)^2\right)^{-\frac{1}{2}} \cdot 2 (\hat{y}_i – y_i)
\end{equation}
\)

\(
\begin{equation}
= \frac{1}{n \times \text{RMSE}} (\hat{y}_i – y_i)
\end{equation}
\)

Comparing the two equations above, it can be concluded that the derivative function of MSE follows linearity which leads to smooth convergence during the optimizing process. The function of RMSE, on the other hand, consists of a square root in the denominator. This leads to non-linearity, hence the model requires more time to converge during optimization.

2. Computation Efficiency

As seen in the section above, mean squared error provides linearity during optimation which leads to faster convergence of the model. To formulate this statement differently, fewer calculations are required for the model to optimize and reach the minimum. This leads to lower consumption of computational power, especially when the dataset consists of an abundance number of samples (we are talking millions!). Hence, the time saved can then be used for fine-tuning the model by controlling the hyperparameters.

3. Consistency in penalizing errors

MSE consistently penalizes all errors, irrespective of their size. This means that the model is encouraged to reduce all types of errors uniformly. RMSE, due to the square root, can sometimes understate the penalty for large errors when considering the gradient.

To illustrate this, consider the following example:

  • Scenario 1: Predicted values are consistently off by 2 units.
  • Scenario 2: Most predictions are accurate, but a few are off by 10 units.

For both scenarios, the former will square these differences, effectively making the cost for large errors much higher. RMSE, while still emphasizing large errors, does so to a slightly lesser extent because the square root reduces the magnitude of the penalization when backpropagating the error.

In scenarios where every error needs to be treated with equal gravity, MSE’s consistency is beneficial.

Understanding the Comparison using an Example

Let us consider a simple linear regression problem of predicting house prices based on the feature “square footage”.

Square FootageActual Price (in $)Predicted Price (in $)
1000150,000148,000
1500200,000195,000
2000250,000260,000
2500300,000305,000
3000350,000360,000

First, we will calculate the MSE on this given data:

\(
\begin{equation}
\text{MSE} = \frac{(150,000 – 148,000)^2 + (200,000 – 195,000)^2 + (250,000 – 260,000)^2 + (300,000 – 305,000)^2 + (350,000 – 360,000)^2}{5}
\end{equation}
\)

\(
\begin{equation}
= \frac{254,000,000}{5}
\end{equation}
\)

\(
\begin{equation}
= 50,800,000
\end{equation}
\)

Taking the square root will give us the RMSE value:

\(\begin{equation}
\text{RMSE} = \sqrt{50,800,000} = 7,128.55
\end{equation}\)

In this case, while the root-mean provides an interpretable value of the prediction error, the mean-squared error provides a direct and unscaled value of the model that illustrates the overall performance of the model.

Moreover, during model optimization, as the error values suggest, mean squared error will penalize the error with a higher weight as compared to RMSE. This also makes it easier to understand how adjustments to the model’s parameters affect the overall performance.

Summary

In conclusion, while RMSE is certainly useful, mean-squared error often holds the upper hand as a cost function during the training phase of a machine learning model. Its mathematical simplicity, computational efficiency, and consistency in error penalization make it a solid choice, especially in large-scale machine learning applications.

That said, the choice between these metrics ultimately depends on the specific context of your problem. If interpretability is crucial, calculating the root of the error is more appropriate. However, for pure optimization and training efficiency, mean squared error is often the go-to metric.

I encourage you to try and experiment with these metrics in your machine learning model and examine and compare them by monitoring the convergence graph during the training phase. Feel free to share your results with me on Instagram: @machinelearningsite, and while you are there, do click on that “Follow” button. Happy modeling!

If you are interested in more blogs on machine learning, consider having a look at “Vanishing Gradient Problem Explained” which explains a commonly experienced issue in deep neural networks. Besides, you can also have a look at “Your Ultimate Machine Learning Roadmap and Top Courses Guide” if you are just starting in the field and want to have an overview of the path to follow.

If you enjoyed this blog, please do leave a follow on my social media as the increase in followers, no matter how small, shows that people do enjoy the blog. This motivates me to make more interesting content. Thank you 🙂

Leave a Reply