Step-by-Step Guide to Support Vector Machines with Hands-On Exercise

When we talk about classification problems, the basic machine learning model that we would generally prefer is logistic regression. It is simple to implement and performs well on large datasets. But there is another classification model that differs from logistic regression in certain ways: Support Vector Machine.

Support Vector Machines are machine learning models that are also used for classification purposes. In this blog, we will understand what SVMs are, how they work , how they differ from the good ol’ logistic regression, and we will also do a small exercise. All the code in this blog is available on my GitHub.

Machine learning models improve with more data. Your feed improves with more of my posts. Follow @machinelearningsite for programming memes, code snippets, and ML tricks—no overfitting, just pure value.

What is Support Vector Machine?

As I mentioned earlier, Support Vector Machines, or SVMs, are a supervised machine learning algorithm used for classification tasks. SVMs work by finding an optimal “hyperplane” that best separates data points into distinct classes. In other words, SVMs finds an optimum decision boundary that maintains a significant distance from the nearest vector of either class.

Consider the dataset in the following image (left). Multiple decision boundaries can separate the two classes available in the data set. Let us consider some decision line as illustrated in the right image:

The decision boundary illustrated above will be able to classify the samples correctly. The problem, however, occurs when a new sample is introduced that will affect this line as shown in the image below:

The chosen boundary line will misclassify the sample of class 0 as class 1. Therefore, the decision boundary needs to be updated. However, this is not a practical solution, as changing the line as soon as a new sample is introduced is not optimal.

To avoid this issue of selecting a random decision boundary, SVM selects one that not only classifies the samples but also maintains the largest margin from the nearest data points from each of the classes. These points are called support vectors.

Hard Margin and Soft Margin in Support Vector Machine

As mentioned in the earlier section, the optimum decision boundary maintains a particular margin between itself and the support vectors. Depending on how strictly the data is separated, SVM models can be categorized into Hard Margin and Soft Margin classifiers.

Hard Margin SVM

A hard margin SVM requires that the data be perfectly linearly separable, meaning there should be a clear, unbroken boundary between classes without any overlap or misclassified points. In a hard margin setup, SVM aims to maximize the margin as much as possible, ensuring that no data points fall within the margin or on the wrong side of the hyperplane. This type of SVM works best when the data is clean and well-separated, without noise or overlapping classes.

In the above illustrated example, a hard margin classifier would perform well until such samples of the classes introduced where the data is no longer linearly and clearly separable. This is where soft margin classifier comes in.

Soft Margin SVM

A Soft Margin SVM aims to keep the margin as large as possible, simultaneously allowing for some inaccuracies by introducing a slack variable. This approach helps create a more flexible boundary where some data points can lie within the margin or even on the wrong side of the hyperplane, allowing the model to tolerate a certain degree of error. This is demonstrated in the image below.

A soft margin parameter, also known as the penalty term-“C” controls the trade-off between maximizing the margin and minimizing classification error. In the case of high values of C, the model is stricter, allowing fewer misclassifications and narrower margin. Lower C-values, on the other hand, results in a more lenient model, tolerating more misclassifications and allowing a wider margin, which can improve generalization for noisy or overlapping data.

Hands-On Tutorial on SVM

A well-known classification example of Support Vector Machine is the famous iris dataset which contains the geometrical data of three species of iris flower, namely iris-virginica, iris-setosa and iris-versicolor. The geometrical data available are sepal length, sepal width, petal length and petal width. Based on these features, the task is to identify the species of a given iris flower.

We start by importing the data and printing the classes that are available in the dataset. Despite of having some information on the available data, let us assume for a moment that we are unaware of these. In such a situation, it is always good to understand the number of classes to determine if the model should be a binary classification or multi-class classification model.

import pandas as pd
import seaborn as sns

# Import the dataset using Pandas library
data = pd.read_csv('IRIS.csv')

# Knowing the available features in our data
print(data.head())

# Knowing the number of available classes
print(data["species"].value_counts())

import pandas as pd
import seaborn as sns

# Import the dataset using Pandas library
data = pd.read_csv('IRIS.csv')

# Knowing the available features in our data
print(data.head())

# Knowing the number of available classes
print(data["species"].value_counts())

Next, we plot the data to understand if the data is distinctly classified or if the classes overlap in some regions.

fig = data[data.species=='Iris-setosa'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='orange', label='Setosa')
data[data.species=='Iris-versicolor'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='blue', label='versicolor',ax=fig)
data[data.species=='Iris-virginica'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='green', label='virginica', ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Width")
fig.grid()
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

fig = data[data.species=='Iris-setosa'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='orange', label='Setosa')
data[data.species=='Iris-versicolor'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='blue', label='versicolor',ax=fig)
data[data.species=='Iris-virginica'].plot(kind='scatter',x='sepal_length',y='sepal_width',color='green', label='virginica', ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Width")
fig.grid()
fig=plt.gcf()
fig.set_size_inches(10,6)
plt.show()

In this plot of sepal length vs. sepal width, the class ‘Iris-setosa’ is almost clearly distinguishable except for that one sample that lies near the bottom right corner. The other two classes, however are mixed within each other. But remember, we have four features and this plot shows classification over only 2 features. Let us plot again but with petal length and sepal length as features.

This provides a better picture. Setosa is very distinguishable, and versicolor and virginica are also almost distinct.

Next, we split the data into training and test set. One needs to be careful at this step. Why? Let us see.

from sklearn.model_selection import train_test_split

# Splitting the independent variables from dependent variables
X = data.iloc[:,:-1]
y = data.iloc[:,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

X_train['target'] = y_train
X_test['target'] = y_test

print("\nX_train:")
print(X_train["target"].value_counts())
print("\nX_test:")
print(X_test["target"].value_counts())

'''
Original Data:
species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

X_train:
target
Iris-setosa        39
Iris-virginica     37
Iris-versicolor    29
Name: count, dtype: int64

X_test:
target
Iris-versicolor    21
Iris-virginica     13
Iris-setosa        11
Name: count, dtype: int64
'''

from sklearn.model_selection import train_test_split

# Splitting the independent variables from dependent variables
X = data.iloc[:,:-1]
y = data.iloc[:,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

X_train['target'] = y_train
X_test['target'] = y_test

print("\nX_train:")
print(X_train["target"].value_counts())
print("\nX_test:")
print(X_test["target"].value_counts())

'''
Original Data:
species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

X_train:
target
Iris-setosa        39
Iris-virginica     37
Iris-versicolor    29
Name: count, dtype: int64

X_test:
target
Iris-versicolor    21
Iris-virginica     13
Iris-setosa        11
Name: count, dtype: int64
'''

We use the train_test_split() function of sklearn to split our data. Notice the distribution of classes throughout the original data set. On printing, it gets evident that the distribution is not maintain in training as well as in the test set. Training a model on such data where the distribution varies from the original data leads to inaccuracies.

Stratified splitting is the data splitting approach prevents this problem. It takes the target column and maintains the same distribution of data in training as well as test set. In this example, the appropriate split would be to maintain the same distribution of the target class, i.e. “species”, throughout the data sets. To implement this, the train_test_split() function accepts the argument statify as the column to be used for the split.

In the code above, change line 6 with the following code:

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, stratify=y)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, stratify=y)

Let us check the distribution again.

print("\nOriginal Data:")
print(data["species"].value_counts())

X_train['target'] = y_train
X_test['target'] = y_test

print("\nX_train:")
print(X_train["target"].value_counts())
print("\nX_test:")
print(X_test["target"].value_counts())

X_train = X_train.drop(columns=["target"])
X_test = X_test.drop(columns=["target"])

'''
Original Data:
species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

X_train:
target
Iris-versicolor    35
Iris-setosa        35
Iris-virginica     35
Name: count, dtype: int64

X_test:
target
Iris-versicolor    15
Iris-setosa        15
Iris-virginica     15
Name: count, dtype: int64
'''

print("\nOriginal Data:")
print(data["species"].value_counts())

X_train['target'] = y_train
X_test['target'] = y_test

print("\nX_train:")
print(X_train["target"].value_counts())
print("\nX_test:")
print(X_test["target"].value_counts())

X_train = X_train.drop(columns=["target"])
X_test = X_test.drop(columns=["target"])

'''
Original Data:
species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

X_train:
target
Iris-versicolor    35
Iris-setosa        35
Iris-virginica     35
Name: count, dtype: int64

X_test:
target
Iris-versicolor    15
Iris-setosa        15
Iris-virginica     15
Name: count, dtype: int64
'''

We now have proper distribution of samples as in the original data. The model can now be trained and evaluated on this new training and test sets respectively. Notice that the training and the test set still contain the column “target” which was just used to monitor the data distribution. It is important to drop it before training the model.

Training the SVM Model with different Penalty Terms C

As discussed earlier, the penalty term C controls the trade-off between achieving a low error on the training data and having a simple decision boundary. Lower value results in more lenient model,hence relatively higher misclassifications and vice versa. We start by considering the value of C as 0.1.

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report


model=SVC(C=0.1)

model.fit(X_train, y_train)

#  Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Confusion Matrix:")
print(conf_matrix)

# Plot Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=y.unique(), yticklabels=y.unique())
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report


model=SVC(C=0.1)

model.fit(X_train, y_train)

#  Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Confusion Matrix:")
print(conf_matrix)

# Plot Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=y.unique(), yticklabels=y.unique())
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In classification problems as this, a confusion matrix provides more insight on the model performance than just printing the model accuracy. Below is the confusion matrix of our current model.

As expected, the model misclassifies yet 1 sample as Iris-versicolor when it is actually of the class Iris-virginica. On increasing the value of C, we must expect a model with better classification ability. We create another model with C=1.

model2=SVC(C=1)

model2.fit(X_train, y_train)

y_pred_train_2 = model2.predict(X_train)
y_pred_test_2 = model2.predict(X_test)
print('The accuracy of the SVM is:',metrics.accuracy_score(y_pred_train_2,y_train))
print('The accuracy of the SVM is:',metrics.accuracy_score(y_pred_test_2,y_test))

from sklearn.metrics import confusion_matrix, classification_report
#  Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_test_2)
print("Confusion Matrix:")
print(conf_matrix)

# Plot Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=y.unique(), yticklabels=y.unique())
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

model2=SVC(C=1)

model2.fit(X_train, y_train)

y_pred_train_2 = model2.predict(X_train)
y_pred_test_2 = model2.predict(X_test)
print('The accuracy of the SVM is:',metrics.accuracy_score(y_pred_train_2,y_train))
print('The accuracy of the SVM is:',metrics.accuracy_score(y_pred_test_2,y_test))

from sklearn.metrics import confusion_matrix, classification_report
#  Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_test_2)
print("Confusion Matrix:")
print(conf_matrix)

# Plot Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=y.unique(), yticklabels=y.unique())
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Woohoo! The model now is able to classify each and every sample correctly. Yet this model is not ready to be deployed. The reason is the number of samples in the data set. The original data contains only 150 samples, of which just 105 are used for training after the data split. A model trained on such low number of samples is not reliable. Any machine learning model needs tens of thousands of samples to say the least. Only then is the model able to learn properly through the features.

Nevertheless, such small data sets are a good stepping stone to get your hands on machine learning. It always fun to plot those insightful plots and get used to the machine learning APIs like scikit in this blog.

Summary

Support Vector Machines (SVMs) are powerful supervised learning algorithms for classification. Unlike logistic regression, SVMs focus on finding the optimal hyperplane that maximizes the margin between classes, ensuring robustness to new data. There are two types of SVMs to consider: hard margin and soft margin SVM.

Moreover, in this blog, we developed an SVM model which was trained on the Iris dataset. The example covered:

Stratified Data Splitting: Ensuring proportional class distribution for better model training.
Impact of C: How different values of the penalty term influence decision boundaries and performance.
Model Evaluation: Using confusion matrices and classification reports to assess accuracy.

Unlike this exercise, real-world applications require significantly larger datasets for reliable performance. This exercise is an engaging starting point for mastering SVMs and machine learning tools like scikit-learn.

3 Simple Questions for You

I hope you enjoyed this blog on Support Vector Machines. Now that you’ve reached this far, I have 3 simple questions for you:
1. Do you enjoy memes?
2. Did you enjoy this blog?
3. Do you like pizzas?

If either of the answer is a yes, go ahead and immediately follow me on my Instagram (@machinelearningsite) where I post programming and machine learning memes and also useful tips and tricks. Why don’t you just go and check it yourself? 😉