Diabetes Prediction using Machine Learning

With the rise of data-driven technologies, machine learning has emerged as a powerful tool to predict the risk of diabetes based on patient data such as age, BMI, blood pressure, glucose levels, and other health indicators. By leveraging historical medical datasets, machine learning algorithms can identify complex patterns and correlations that are often missed by traditional statistical methods.

In this tutorial, we will explore how to build a diabetes prediction model using Python and popular machine learning libraries. From preprocessing the dataset to selecting the right algorithm and evaluating model performance, you will learn step by step how to create a predictive system that can help in early diagnosis. It is one of the popular exercises in machine learning. With over 750 samples and 8 features, the dataset can be found on Kaggle. It is a binary classification problem where the output is either 0 or 1 (0 means the patient is not suffering from diabetes and 1 means otherwise).

Whether you are a beginner in machine learning or a data science enthusiast looking to improve your skills in Pandas and also Python programming, this guide provides a practical and hands-on approach to understanding the library.

Creating the Machine Learning Model for Diabetes Prediction

First, we will start by importing the necessary libraries that will be useful to build and train the model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn import metrics

Subsequently, we will import the dataset using Pandas. Pandas provides a function called read_csv() that makes importing the data an easy job. head() displays the first 5 rows of the imported data.

DATA_PATH = '/content/drive/MyDrive/Colab_Notebooks/Machine-Learning-Site/Pandas/diabetes.csv'

data = pd.read_csv(DATA_PATH)
data.head()

diabetes prediction using machine learning
Figure 1: Visualizing imported data

The column ‘Outcome’ as seen in Figure 1 represents the label for their respective features: Pregnancy, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age.

It is indeed important to know the number of features and the number of samples available to train out data.

data.shape

diabetes prediction machine learning
Figure 2: Shape of data

[inlinetweet]Machine learning models calculate losses by taking the difference between the hypothesis and the label.[/inlinetweet] Hence, it becomes important to separate the labels from the features.

labels = data['Outcome']
data.drop(['Outcome'], axis=1, inplace=True)

Further, the data is to be divided into training and test set. The training set is the one on which the model is trained. [inlinetweet]The trained model finally predicts the outcome on the unseen data of the test set. Consequently, monitoring the model performance on the seen and the unseen data yields the generalization error of the model, or the accuracy of the model.[/inlinetweet]

Those who are familiar with Numpy could argue that it could also be used to parse the data. But Pandas is relatively more efficient and faster when we deal with data set containing tens of thousand of samples. When processing the information, doing so with Pandas can save more time time than Python.

train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.3)
train_x.shape, train_y.shape

The argument test_size takes in the ratio of the data that makes up the test set. Here, data is divided such that 30% makes up the test set and the remaining 70% makes up the training set. Note that the data splitting implemented here is the simple one. There are other ways to split the data such as stratified split, so those who want to get familiar with deeper concepts of machine learning can click here and proceed to the next blog.

Defining the machine learning model

Next, we will define our prediction model. The logistic regression model is the appropriate model for this exercise as it is a binary classification problem. Logistic Regression has a sigmoid activation function in which, the output lies between 0 and 1. By default, the threshold is set to 0.5. This means that the model will return the output as 1 if the hypothesis is above the threshold value, other zero. Additionally, it is important to let the model train for a significant number of iterations.

model = LogisticRegression(max_iter=5000)
model.fit(train_x, train_y)

Now, it is time to test the model on unseen data.

score = model.score(test_x, test_y)
print(score)

diabetes prediction machine learning
Figure 3: Model accuracy

As seen in the figure above, the model scored 77% on the test set which seems acceptable, given that there was no data preprocessing before training the model.

Summary

To summarize, we had a glimpse of using a logistic regression model to predict if a patient is suffering from diabetes. The logistic regression model is an adequate model for classification problems like this one. However, it would be interesting to compare the results with other classification models like decision trees and k-nearest neighbors.

Leave a Reply