In my previous blog on “Build your *FIRST* Machine Learning using Python“, I mentioned that a general flow in a machine learning project involves importing the data, splitting it into training and test sets, training the model to fit the data, and ultimately evaluating it using the test set. To improve the model performance, more steps are integrated such as hyperparameter tuning and data augmentation.
So today in this blog, I’ll walk you through an MNIST project where we use the KNeighborsClassifier to get those handwritten digits sorted. The MNIST dataset is one of those must-do projects when you’re diving into machine learning—it’s like a rite of passage. Consisting of image data, it provides a good starting point to not only understand the concept of machine learning but also the image data itself. It provides a fine interpretation of pixels which are the building blocks of an image.
But we’re not stopping there. We’ll dive into hyperparameter tuning with GridSearchCV and experiment with a neat trick: shifting pixel data to see if we can crank up the model’s performance. Whether you’re a seasoned pro or just brushing up, this walkthrough has some cool insights you won’t want to miss! So let’s get started.
Table of Contents
Importing the Dependencies
To work on our MNIST classification problem, we first start by importing the necessary dependencies as usual.
from sklearn.datasets import fetch_openml<br>from sklearn.model_selection import StratifiedKFold, GridSearchCV, StratifiedShuffleSplit<br>from sklearn.neighbors import KNeighborsClassifier<br>from sklearn.pipeline import clone<br>from scipy.ndimage.interpolation import shift<br>import matplotlib.pyplot as plt<br>import numpy as np<br>import random
Here’s a brief introduction to the imported functions. Don’t worry if it seems vague for now. You’ll understand their significance once we reach their applications.
fetch_openml
is a convenient function from the modulescikit-learn
to load the MNIST dataset directly from OpenML. This saves us from the hassle of manual downloading and processing.- Next up, we bring in some essential tools for evaluating our model and tuning its hyperparameters.
StratifiedKFold
andStratifiedShuffleSplit
help maintain balanced class distributions across training and testing, whileGridSearchCV
doing the heavy lifting of finding the optimal hyperparameters. - Our model of choice for this project is the K-Nearest Neighbors classifier, imported directly from
scikit-learn
. It’s a straightforward algorithm that works wonders with well-preprocessed data. - We also import
clone
fromsklearn.pipeline
, which allows us to make fresh copies of our model for consistent training and evaluation. - To boost our model’s performance, we’ll be using the
shift
function fromscipy
to augment our dataset by randomly shifting the image pixels. This helps simulate different variations of the digits. - For visualizing our data and the results, we’ll rely on
matplotlib
, a powerful plotting library that makes it easy to create and customize plots. - We also import
numpy
for handling numerical operations and data manipulation. This will help us in converting a Series type data into an array, which makes data processing more convenient. - Lastly, the
random
module will help us introduce randomness into our process, such as in shuffling data or making random transformations to our images. In our case, the pixels in images should be shuffled in a random direction, that is, not all images should have their pixels shifted in the same direction.
In case you face any error such as “no module named $ModuleName”, run this code in your virtual environment: pip3 install -U scikit-learn scipy matplotlib numpy
. It is always recommended to work in a virtual environment as it avoids any potential conflict between dependencies that could arise due to different versions.
Now that we have our dependencies in our project, we extract the MNIST data:
mnist = fetch_openml("mnist_784", version=1)
X, y = mnist["data"], mnist["target"]
X.shape, y.shape
# [Output]: ((70000, 784), (70000,))
The MNIST data set contains 70000 image samples and their corresponding labels. Each image contains 784 pixels that are available in a flattened form. However, to view an image, it needs to be in a matrix form, hence we need to convert it into a 28*28 matrix.
plt.imshow(X.iloc[0].array.reshape(28,28))
Next, we split the data.
Stratified Data Split
A traditional method to split a given data is following the 80-20 rule: 80% of the randomly selected data goes to the training set and the remaining 20 to the test. However, in problems such as this classification problem, there is a possibility that more samples with labels of a particular category could end up entirely in the test set. This results in the false classification of such samples by the model.
To avoid this, it is important to maintain the ratio of labels in the test set as it is in the training set. This is called stratified data split. I’ve made an entirely different blog on this topic: How to sample data *CORRECTLY* into test set? It is an interesting read and would recommend you to check it out.
For our MNIST classification problem, let us print out the proportion of labels distributed across the data set:
for elem in range(10):
print("Proportion of number {} in original data: {:.3f}".format(elem, sum(y == str(elem))/len(y)))
#[Output]:
#Proportion of number 0 in original data: 0.099
#Proportion of number 1 in original data: 0.113
#Proportion of number 2 in original data: 0.100
#Proportion of number 3 in original data: 0.102
#Proportion of number 4 in original data: 0.097
#Proportion of number 5 in original data: 0.090
#Proportion of number 6 in original data: 0.098
#Proportion of number 7 in original data: 0.104
#Proportion of number 8 in original data: 0.098
#Proportion of number 9 in original data: 0.099
We need to maintain this proportion in the training set as well. For this, we use the StratifiedShuffleSplit
function of the sklearn module.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(X, y):
strat_X_train = X.iloc[train_index]
strat_X_test = X.iloc[test_index]
strat_y_train = y.iloc[train_index]
strat_y_test = y.iloc[test_index]
Let us check the distribution of the samples.
for elem in range(10):
print("Proportion of number {} in training set: {:.3f}".format(elem, sum(strat_y_train == str(elem))/len(strat_y_train)))
#[Output]:
#Proportion of number 0 in training set: 0.099
#Proportion of number 1 in training set: 0.113
#Proportion of number 2 in training set: 0.100
#Proportion of number 3 in training set: 0.102
#Proportion of number 4 in training set: 0.097
#Proportion of number 5 in training set: 0.090
#Proportion of number 6 in training set: 0.098
#Proportion of number 7 in training set: 0.104
#Proportion of number 8 in training set: 0.098
#Proportion of number 9 in training set: 0.099
As you see, the stratified split maintained the distribution of labels in the training set as it is in the main data set. The same is done for the test set too.
Training the KNeighborsClassifier model
Next, we will simply train the model on the training data without any preprocessing to monitor its performance. This provides us with the reference value of the model performance and helps us understand the performance when the data is eventually preprocessed.
skfolds = StratifiedKFold(n_splits=3)
classifier = KNeighborsClassifier()
for train_index, test_index in skfolds.split(strat_X_train, strat_y_train):
clone_clf = clone(classifier)
X_train_fold = strat_X_train.iloc[train_index]
y_train_fold = strat_y_train.iloc[train_index]
X_test_fold = strat_X_train.iloc[test_index]
y_test_fold = strat_y_train.iloc[test_index]
clone_clf.fit(X_train_fold, y_train_fold)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred==y_test_fold)
print(n_correct/len(y_pred))
# [Output]:
#0.9661434617238978
#0.9682327101301762
#0.9690346083788707
This code does not look like the general “model.fit()” code that we use to train a model. Here, we’re implementing cross-validation with the StratifiedKFold
technique to evaluate our K-Nearest Neighbors classifier. Here’s a breakdown of what’s happening:
- StratifiedKFold Setup: We start by creating an instance of
StratifiedKFold
withn_splits=3
. This means we’ll be splitting our data into three different folds, ensuring that each fold maintains the same proportion of classes as the original dataset. This helps in getting a reliable estimate of the model’s performance across different subsets of the data.
- Model Cloning and Training: For each fold in the split, we clone our
KNeighborsClassifier
usingclone(classifier)
. Cloning is crucial here because it allows us to train and evaluate a fresh instance of the classifier for each fold, without any residual state from previous folds affecting the results.
- Data Splitting: Within each fold, the training and testing data are separated.
X_train_fold
andy_train_fold
are used for training the model, whileX_test_fold
andy_test_fold
are used for testing. This ensures that our model is trained and evaluated on different subsets of the data.
- Model Training and Prediction: We fit the cloned classifier to the training data of the current fold with
clone_clf.fit(X_train_fold, y_train_fold)
. Then, we predict the labels for the test set usingclone_clf.predict(X_test_fold)
.
- Performance Evaluation: We calculate the accuracy of the model for the current fold by comparing the predicted labels (
y_pred
) to the true labels (y_test_fold
). The number of correct predictions is counted, and accuracy is computed as the ratio of correct predictions to the total number of predictions. This is printed out for each fold, giving us an insight into how well our model performs across different data subsets.
This approach allows us to evaluate our model’s performance, providing a more comprehensive view of how it might perform on unseen data. As it is seen, the model’s performance peaks at about 96.76%. Let us try to push this up by finding the optimum hyperparameters using grid search.
Finding the optimum hyperparameters using Grid Search
The GridSearchCV()
function of sklearn allows us the perform the grid search method for hyperparameter search. The function takes in a list as an argument that contains the hyperparameters of the concerned model and the interested values of the same. The KNeighborsClassifier contains the hyperparameters n_neighbors and weights that can be tuned.
param_grid = [
{'n_neighbors': [3, 4, 5] ,
'weights': ["uniform", "distance"]},
]
We create a list called param_grid that contains the hyperparameters n_neighbors and weights with their corresponding values. The GridSearchCV
function will generate 6 combinations (3*2) of the involved hyperparameters. It will use a different combination for each iteration to fit the model on the training set and check the score.
grid_search = GridSearchCV(classifier, param_grid, cv=5, verbose=3)
grid_search.fit(strat_X_train, strat_y_train)
# [Output]:
#[CV 1/5] END ....n_neighbors=3, weights=uniform;, score=0.969 total time= 26.6s
#[CV 2/5] END ....n_neighbors=3, weights=uniform;, score=0.970 total time= 26.0s
#[CV 3/5] END ....n_neighbors=3, weights=uniform;, score=0.970 total time= 26.6s
#[CV 4/5] END ....n_neighbors=3, weights=uniform;, score=0.973 total time= 26.8s
#[CV 5/5] END ....n_neighbors=3, weights=uniform;, score=0.969 total time= 26.1s
#[CV 1/5] END ...n_neighbors=3, weights=distance;, score=0.970 total time= 25.8s
#[CV 2/5] END ...n_neighbors=3, weights=distance;, score=0.971 total time= 24.9s
#[CV 3/5] END ...n_neighbors=3, weights=distance;, score=0.971 total time= 26.2s
#[CV 4/5] END ...n_neighbors=3, weights=distance;, score=0.974 total time= 26.0s
#[CV 5/5] END ...n_neighbors=3, weights=distance;, score=0.970 total time= 26.0s
#[CV 1/5] END ....n_neighbors=4, weights=uniform;, score=0.968 total time= 27.6s
#[CV 2/5] END ....n_neighbors=4, weights=uniform;, score=0.968 total time= 27.6s
#[CV 3/5] END ....n_neighbors=4, weights=uniform;, score=0.970 total time= 27.7s
#[CV 4/5] END ....n_neighbors=4, weights=uniform;, score=0.971 total time= 27.8s
#[CV 5/5] END ....n_neighbors=4, weights=uniform;, score=0.968 total time= 27.7s
#[CV 1/5] END ...n_neighbors=4, weights=distance;, score=0.972 total time= 27.1s
#[CV 2/5] END ...n_neighbors=4, weights=distance;, score=0.972 total time= 27.4s
#[CV 3/5] END ...n_neighbors=4, weights=distance;, score=0.972 total time= 27.4s
#[CV 4/5] END ...n_neighbors=4, weights=distance;, score=0.976 total time= 28.1s
#[CV 5/5] END ...n_neighbors=4, weights=distance;, score=0.971 total time= 27.9s
#[CV 1/5] END ....n_neighbors=5, weights=uniform;, score=0.968 total time= 27.4s
#[CV 2/5] END ....n_neighbors=5, weights=uniform;, score=0.968 total time= 27.6s
#[CV 3/5] END ....n_neighbors=5, weights=uniform;, score=0.969 total time= 28.4s
#[CV 4/5] END ....n_neighbors=5, weights=uniform;, score=0.972 total time= 28.4s
#[CV 5/5] END ....n_neighbors=5, weights=uniform;, score=0.969 total time= 27.8s
#[CV 1/5] END ...n_neighbors=5, weights=distance;, score=0.969 total time= 27.3s
#[CV 2/5] END ...n_neighbors=5, weights=distance;, score=0.970 total time= 27.3s
#[CV 3/5] END ...n_neighbors=5, weights=distance;, score=0.971 total time= 27.6s
#[CV 4/5] END ...n_neighbors=5, weights=distance;, score=0.973 total time= 28.2s
#[CV 5/5] END ...n_neighbors=5, weights=distance;, score=0.971 total time= 27.8s
As we have mentioned 5 cross-validation folds, the function will generate 5 folds for each of the 6 combinations, resulting in a total of 30 iterations. You can now sit back and relax for a bit as this process can take up to 30 min or more, depending on your CPU and GPU power.
After the iterations are done, we check the best parameters and the best score of the grid search.
print("Best Params: {}\nBest Score: {}".format(grid_search.best_params_, grid_search.best_score_))
#[Output]:
#Best Params: {'n_neighbors': 4, 'weights': 'distance'}
#Best Score: 0.9723928571428573
So our model will perform the best with the value of n_neighbors as 4 and weights set to “distance”. This resulted in an accuracy of about 97.23%. Let us carry out the cross-validation check as we did previously.
classifier_2 = KNeighborsClassifier(n_neighbors= 4, weights='distance')
for train_index, test_index in skfolds.split(X, y):
clone_clf = clone(classifier_2)
X_train_fold = X.iloc[train_index]
y_train_fold = y.iloc[train_index]
X_test_fold = X.iloc[test_index]
y_test_fold = y.iloc[test_index]
clone_clf.fit(X_train_fold, y_train_fold)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred==y_test_fold)
print(n_correct/len(y_pred))
#[Output]:
#0.9717150938544613
#0.9692281318304548
#0.9700424291775597
The result averages up to 97.03%. This is a small leap, yet certainly better than the 96.7% scored without hyperparameter tuning.
We will proceed and take it a step further by augmenting the data.
Augmenting Image Data
To make the model more robust, we shift the image pixels in a random direction on our MNIST data by using the shift()
function of the scipy library.
def shift_image_pxls(image):
augmented_image = shift(image.array.reshape(28,28), [random.randint(-4,4),random.randint(-4,4)], cval=0)
return augmented_image.reshape([-1])
augmented_X_train = np.zeros((strat_X_train.shape[0], strat_X_train.shape[1]))
for image_idx in range(strat_X_train.shape[0]):
augmented_X_train[image_idx] = shift_image_pxls(strat_X_train.iloc[image_idx])
Now that we have augmented the stratified training data, we will visualize one of the images. Note that the stratified_X_train
which was previously of the Series data type, after augmentation, turns into an array (augmented_X_train
). For this reason, strat_y_train
should also be converted into an array.
plt.imshow(augmented_X_train[0].reshape(28,28))
strat_y_train = strat_y_train.array
print(strat_y_train[0])
It is now time for the test. Let us check the performance of this augmented data.
optimized_classifier = KNeighborsClassifier(n_neighbors= 4, weights='distance')
clone_clf.fit(augmented_X_train, strat_y_train)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred==y_test_fold)
print(n_correct/len(y_pred))
#[Output]: 1
100% accuracy, hurray! The model fits perfectly for MNIST classification, or does it? While this performance is satisfactory, this cannot be termed as the “end” of the problem. There are other things to consider such as precision, recall, and confusion matrix in a classification that goes beyond the scope of this blog. However, if you are interested, I have explained those topics in a different blog independent of this MNIST example, go check it out: Confusion Matrix 101: Understanding Precision and Recall for Machine Learning Beginners.
Summary
In this lengthy blog, we initially trained a KNeighborsClassifier model on the MNIST data set and monitored its performance using cross-validation. We then undertook the Grid Search approach to find the optimum hyperparameter of the model to enhance its performance. To make the model more robust, we eventually augmented the MNIST data by shifting the pixels in a random direction and again trained a new model with the optimized hyperparameter.
This blog aimed to introduce the concept of the grid search approach to yield a model with the most appropriate hyperparameters. Besides, as far as a classification problem as such is concerned, there are more metrics to focus on than just accuracy, namely precision, recall, and confusion matrix.
If you enjoyed this blog, please leave a follow on my social media. It will not cost you anything but that single addition to my subscribers/followers list will motivate me to create interesting content.
Pingback: Understanding Regularization in Machine Learning: Ridge, Lasso, and Elastic Net - Machine Learning Site