Diabetes Prediction using Machine Learning

Diabetes prediction is one of the popular exercises in machine learning. With over 750 samples and 8 features, the dataset can be found on Kaggle. It is a binary classification problem where the output is either 0 or 1 (0 means the patient is not suffering from diabetes and 1 means otherwise).

In this exercise, we will use Pandas to import, view the dataset and separate the features from the labels. Furthermore, the data will be split into train and test sets. Ultimately, logistic regression model imported from ‘sklearn’ library will be trained on the training set and the accuracy will be measured on the test set.

Creating the Machine Learning Model for Diabetes Prediction

First, we will start by importing the necessary libraries that will be useful to build and train the model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn import metrics

Subsequently, we will import the dataset using Pandas. Pandas provides a function called read_csv() that makes importing the data an easy job. head() displays the first 5 rows of the imported data.

DATA_PATH = '/content/drive/MyDrive/Colab_Notebooks/Machine-Learning-Site/Pandas/diabetes.csv'

data = pd.read_csv(DATA_PATH)
data.head()

diabetes prediction using machine learning
Figure 1: Visualizing imported data

The column ‘Outcome’ as seen in Figure 1 represents the label for their respective features: Pregnancy, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age.

It is indeed important to know the number of features and the number of samples available to train out data.

data.shape

diabetes prediction machine learning
Figure 2: Shape of data

[inlinetweet]Machine learning models calculate losses by taking the difference between the hypothesis and the label.[/inlinetweet] Hence, it becomes important to separate the labels from the features.

labels = data['Outcome']
data.drop(['Outcome'], axis=1, inplace=True)

Further, the data is to be divided into training and test set. The training set is the one on which the model is trained. [inlinetweet]The trained model finally predicts the outcome on the unseen data of the test set. Consequently, monitoring the model performance on the seen and the unseen data yields the generalization error of the model, or the accuracy of the model.[/inlinetweet]

train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.3)
train_x.shape, train_y.shape

The argument test_size takes in the ratio of the data that makes up the test set. Here, data is divided such that 30% makes up the test set and the remaining 70% makes up the training set.

Defining the machine learning model

Next, we will define our prediction model. The logistic regression model is the appropriate model for this exercise as it is a binary classification problem. Logistic Regression has a sigmoid activation function in which, the output lies between 0 and 1. By default, the threshold is set to 0.5. This means that the model will return the output as 1 if the hypothesis is above the threshold value, other zero. Additionally, it is important to let the model train for a significant number of iterations.

model = LogisticRegression(max_iter=5000)
model.fit(train_x, train_y)

Now, it is time to test the model on unseen data.

score = model.score(test_x, test_y)
print(score)

diabetes prediction machine learning
Figure 3: Model accuracy

As seen in the figure above, the model scored 77% on the test set which seems acceptable, given that there was no data preprocessing before training the model.

Summary

To summarize, we had a glimpse of using a logistic regression model to predict if a patient is suffering from diabetes. The logistic regression model is an adequate model for classification problems like this one. However, it would be interesting to compare the results with other classification models like decision trees and k-nearest neighbors.

Leave a Reply