Machine Learning Pipeline: The Ultimate Guide to Automate Your Work using scikit learn

Building a machine learning model involves a sequence of tasks from data cleaning to feature engineering. In scenarios where these steps need to be repeated, for instance when data is updated regularly, the process can become complex and time consuming. This is where a machine learning pipeline becomes useful.

A Machine Learning Pipeline allows you to chain multiple data processing and transformation steps together into a single, unified object. This streamlines your workflow by treating all these steps as a single unit. Once the pipeline is defined, the data is fed to the pipeline which then takes care of preprocessing steps like dealing with null values, standardizing, one hot encoding, and deriving new attributes from the existing ones consume a significant amount of time. This keeps the code readable and simplified.

In this blog, we will build such a pipeline using scikit-learn library in Python. Not only will you gain a proper understanding of this concept, but you will also be able to implement it yourself in your machine learning projects. Feel free to code along. Let’s get started!

Importing and Understanding the Data

Understanding the involved data is one of the most crucial steps when building a model. This gives us an idea of the necessary steps that should be taken to maximize the model performance. For this blog, we will import data on house price prediction. The data contains the following features that it contains (click here to download this data):

  • longitude
  • latitude
  • housing median age
  • total rooms
  • total bedrooms
  • population
  • households
  • median income
  • ocean proximity

Our target attribute/ feature in this case is median house value.

After importing, we have a quick view of its description to understand the total number of features/attributes and their statistical data.

import numpy as np
import pandas as pd

housing_data = pd.read_csv(HOUSING_PATH+"/housing.csv")
housing_data.describe()

machine learning pipeline

These statistics are important to understand the features, such as the mean value, maximum value, and standard deviation. We will keep this topic for another blog. We aim to build a pipeline in this one so let’s proceed.

Issue 1

We notice that the data set contains 20640 samples of housing information. If you look carefully, the column of total_bedrooms suffers from missing data. Let us tag this issue as issue 1.

Issue 2

Next, among these features, there is one that contains categorical values and of string datatype, that is ocean_proximity.

housing_data["ocean_proximity"].value_counts()

machine learning pipeline

A machine learning model only understands numbers. Hence, this data should be converted into numerical values using one hot ending. This will be our issue 2.

Issue 3

Sometimes, it is possible to extract more information from a given dataset than is available. In this dataset, we have attributes total_rooms and households but there is no information about the total number of rooms per household. Hence, this can be derived with the available two attributes. Our issue 3 is now identified and complete.

To summarize our issues, issue 1 is about missing values, issue 2 is to convert the string values into numeric using one-hot encoding, and issue 3 involves adding or deriving more attributes from the existing ones. Now that the preprocessing steps are determined, the next step is to split the dataset into training and test set.

Splitting the Data

A general idea to split the data is to train the model on a particular set of data (training set) and evaluate its performance on the unseen set (test set). As a usual practice, data is randomly split into training and test sets. However, this involves the risk of unequal distribution of data in either of the set. To understand this issue, assume you conduct a survey where 70% of the data is of people above 40 years of age and the remaining 30% is of people below 40 years. During random data split, there is a possibility that the latter 30% ends up entirely in training set and the test set is full of data from people above 40.

Thus, an appropriate split would maintain the distribution in both the sets, that is, both the sets consists of data where 70% is of the people above 40 years of age and the remaining 30% is of people below 40 years. This is called stratification and the data is called stratified data. Take a look at my blog How to sample data *CORRECTLY* into test set? for more information on the subject and its implementation in Python.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing_data, housing_data["income_category"]):
    stratified_training_set = housing_data.loc[train_index]
    stratified_test_set = housing_data.loc[test_index]

Ultimately, this will result in two stratified datasets, namely, stratified training set and stratified test set. We keep the test set aside for now. Remember that the stratified training set still contains the target label median_house_value, hence this will be dropped to separate the labels from the features. Moreover, the feature ocean_proximity is to be one-hot encoded in the pipeline. Thus, this column too will be separated from the numerical features in the dataset.

training_data = stratified_training_set.drop(["median_house_value"], axis=1)
training_data_numerical = stratified_training_set.drop(["ocean_proximity"], axis=1)
training_data_categorical = stratified_training_set["ocean_proximity"]

All these splits are a bit tricky to understand so below is a tree diagram that summarizes them:

machine learning pipeline

Solution to the Issues

Now that the necessary splits have taken place, we will address the preprocessing issues that we noted earlier:

  1. Issue 1– To deal with missing values, there are multiple approaches to opt from. For instance, the values can be filled up by the median of the column, the mean of the column, or even complete elimination of such data. However, elimination will only reduce the amount of available data for training the model. Hence, the strategy we could choose is to fill them with the median value. sklean provides a function SimpleImputer that takes care of this. This saves us the trouble of writing the extra code. All we need to do is import the function and include it in the pipeline, which we will see in a moment.
  2. Issue 2– The step of converting the string categories into numerical by one hot encoding can get tedious if done manually. Similar to the previous case, sklearn takes the trouble of carrying out one hot encoding using OneHotEncoder. For this too, all we need to do is import the function and implement it directly in the pipeline.
  3. Issue 3– The idea of deriving and adding more features to the data is to be carried out manually. For this, we create a class called CombinedAttributesGenerator.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttributesGenerator(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        rooms_per_household = X[:, 3]/X[:, 6]
        rooms_per_bedroom = X[:, 3]/X[:, 4]
        population_per_household = X[:, 5]/X[:, 6]

        return np.c_[X, rooms_per_household, population_per_household, rooms_per_bedroom]

The class can be a standalone function or could be formed using the base class TransformerMixin and BaseEstimator as seen in the code above. The benefit of TransformerMixin is that using that as base class automatically creates the method fit_transform() without explicitly writing the code. As of BaseEstimator, it provides two extra methods, namely get_params() and set_params() that come in handy while hypertuning the parameters. Using these base classes is not a necessity but is a good practice due to the mentioned benefits.

Now that these issues are solved, they will become the building blocks of the machine learning pipeline.

Building the Machine Learning Pipeline

Recall that our stratified training data is split into numerical and categorical set. Hence, we first build respective pipeline and ultimately unify them.

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("atrributes_adder", CombinedAttributesGenerator()),
])

As seen in the code, issue 1 is solved by implementing SimpleImputer() a directly in the pipeline as I mentioned above. The strategy chosen here to fill up the missing values is by the median of the existing data. This is followed by the CombinedAttributesGenerator() class that adds some ore features to the data. The pipeline for numerical data is now complete. Time for the categorical one!

The categorical part does not need a Pipeline as such as all it needs is to carry out one hot encoding using OneHotEncoder() from sklearn. This function can just be imported and unified to the numerical pipeline.

To compose the full pipeline and carry out separate transformations on numerical and categorical features, we use ColumnTransformer from sklearn:

from sklearn.compose import ColumnTransformer

numerical_attributes = list(training_data_numerical)
categorical_attributes = ["ocean_proximity", "income_category"]

full_pipeline = ColumnTransformer([
    ("numerical", numerical_pipeline, numerical_attributes),
    ("categorical", OneHotEncoder(),categorical_attributes),
])

So what is happening here- numerical_attributes is a list of columns from the dataset that contains numerical features. categorical_attributes defines a list containing the names of the categorical features (“ocean_proximity”) in our data. The full_pipeline object is created from ColumnTransformer that takes a list of tuples defining the transformations for different feature groups. Each tuple has three elements:

  • Transformer Name (str): A user-defined name for the transformation step (e.g., “numerical”, “categorical”).
  • Transformer Object: The actual scikit-learn transformer object that performs the transformation (e.g., numerical_pipeline for numerical features as we defined above, OneHotEncoder() for categorical features).
  • Feature List (list): A list containing the names or indices of the features to which the corresponding transformer should be applied (e.g., numerical_attributes for numerical pipeline, categorical_attributes for one-hot encoding).

That’s it! Our machine learning pipeline is now complete. We can now directly feed in our original stratified training set that contains missing values and categorical data of string type.

training_data_prepared = preprocessing_pipeline.fit_transform(stratified_training_data)

When an update is available with newly collected samples in the data, it just needs to be fed into the pipeline (after stratified splitting of course) and it will take care of the rest.

Summary

In this blog, we had a look at how to build a machine learning pipeline using sckit-learn library in Python. Creating a pipeline allows you to automate the preprocessing steps. Whenever a data is updated with some new information, it just needs to be passed as an argument to the pipeline, thereby allowing saving you time and extra work.

Try this with some new type of preprocessing steps or new data. Feel free to post the results as image on Instagram and tag me in it @machinelearningsite. If you want to learn such amazing concepts and strategies of machine learning, grab a copy of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. From simple linear regression to convolution networks, this book consists of the important concepts that help you learn and have fun in the machine learning world.

If you are still reading this, this means that this blog provided you with some useful information and you enjoyed it. If so, please do support me by following me on social media:

Also subscribe to my monthly newsletter so you never miss a blog.

This Post Has One Comment

Leave a Reply