Gradient Boosting explained: How to Make Your Machine Learning Model Supercharged using XGBoost

Ever felt like your model’s nearly perfect but needs that little extra “boost”? Your model is not too bad to be thrown away in trash but also not good enough to get a green signal for deployment. I mean, you can just deploy it anyway if it is the last day of your internship. But, for those you still like to keep their jobs, I’ve got exactly what you need for your “almost good” machine learning models-Gradient Boosting.

Gradient Boosting is the salt to your machine learning recipe that will transform it from “meh” to wow! If you haven’t given it a shot yet, you are missing out on a powerful method that can seriously level up your machine learning game.

In this post, we will break down Gradient Boosting—what it is, how it works, and why it’s such a popular choice for creating killer models. Ultimately, we will train and build a model using XGBoost and PyTorch.

Machine learning models improve with more data. Your feed improves with more of my posts. Follow @machinelearningsite for programming memes, code snippets, and ML tricks—no overfitting, just pure value.

“Hi, I’m Gradient Boosting. Your model must be glad to meet me!”

Gradient boosting is an ensemble learning technique that builds a model by combining multiple weak learners, typically decision trees, to create a strong predictive model. It works by training each new tree to correct the errors made by the previous trees in the sequence. The key idea is to focus on the instances that were misclassified or had high errors, adjusting the model in each iteration by minimizing a loss function, typically using gradient descent. This process continues until the model reaches an optimal performance or a set number of trees is built. The result is a powerful, accurate model that can handle complex data patterns.

Think of gradient boosting like a team of people trying to solve a puzzle. The first person (the first decision tree) tries their best, but they make some mistakes. The second person comes in and focuses on fixing the pieces that were placed wrong, improving the solution. The third person does the same, concentrating on the previous errors. Each person builds upon the previous effort, gradually improving the solution. By the end, the combined team creates a much better and more accurate puzzle than any one person could have done alone.

How, and more importantly, Why it Works?

As mentioned already, gradient boosting focuses on correcting mistakes. Instead of just taking the average of all predictions, Gradient Boosting keeps refining them by targeting the hardest-to-predict cases. It’s like having an artist who keeps making tiny tweaks to a masterpiece, slowly making it perfect. The model’s “weak learners” (usually decision trees) are kinda like apprentices learning from their mistakes, and with each new apprentice, the model gets better. Here is a summarized version of the four techniques that gradient boosting follows:

Start Simple: The process kicks off with a simple model that makes predictions (like the average of your data).
Find the Mistakes: The next model in the sequence is trained to fix the mistakes made by the first one (called residuals).
Keep Improving: This step repeats—each model corrects the errors of the previous ones until the mistakes are minimal.
Final Prediction: In the end, the model combines all the predictions, with each new tree adjusting the previous one, and boom—you’ve got your supercharged prediction.

Too many Boosts! Which one to choose?

Gradient Boosting comes in different variants, each one having its own specific use-case:

Classic Gradient Boosting Machine (GBM): The OG of boosting algorithms where trees are sequentially added to correct the residual errors from previous trees. Each tree is trained on a modified version of the dataset, with weights adjusted based on the gradient of a specified loss function (e.g., mean squared error). Though slower and less optimized compared to newer variants, it provides a clear understanding of the boosting process.
XGBoost: This is the one you’ll see most data scientists raving about. It’s super fast, handles large datasets like a pro, and adds some fancy features like regularization to prevent overfitting. XGBoost builds on classic GBM by introducing several key improvements. It uses a regularized objective function, combining the loss function with L1 (lasso) and L2 (ridge) regularization to prevent overfitting. XGBoost also supports sparse data handling and uses a more efficient tree-pruning algorithm (max-depth-first), improving speed and accuracy. It incorporates parallelization for faster training and can handle missing values inherently.
LightGBM: Think of this as the speedster of Gradient Boosting. It’s lightning fast, especially with huge datasets, thanks to its histogram-based approach. LightGBM differs from XGBoost by employing a histogram-based algorithm, where continuous features are bucketed into discrete bins. This approach reduces memory usage and speeds up computation. Instead of growing trees level-by-level (as in XGBoost), it uses a “leaf-wise” growth strategy, splitting the most significant leaf nodes first, which often results in smaller trees and better accuracy for complex datasets. It’s especially effective for large datasets with high-dimensional features.
CatBoost: If you’ve got lots of categorical data, CatBoost is your best friend. It automatically handles categorical features, so you don’t need to manually encode them. CatBoost excels in handling categorical features natively by applying a combination of efficient encoding and boosting techniques during training. It uses ordered boosting, which prevents overfitting by ensuring that the model only learns from past data while creating trees. CatBoost also applies symmetric tree structures, which maintain a consistent structure across splits, speeding up predictions while reducing overfitting.

Having this much information on the topic, your fingertips should be eager to get themselves on keyboard for some testing of gradient boosting. So let’s jump into the code.

Gradient Boosting in Action: Let’s Code!

We’ll create a hybrid model that processes tabular data with gradient boosting and passes the predictions as additional input features to a PyTorch neural network.

A hybrid model combines multiple machine learning models to leverage their individual strengths for improved performance. For example, in this code, XGBoost will be used to capture complex relationships in structured data and provide probability predictions, while a PyTorch neural network will take these predictions along with the original features to refine the final output. This combination will allow the model to benefit from both the powerful boosting algorithm of XGBoost and the flexibility of deep learning in PyTorch, often resulting in better predictions and handling of various aspects of the problem more effectively than using a single model alone.

This code builds and trains a regression model using PyTorch to predict the class of wine from the wine dataset. This dataset contains 13 different parameters for wine with 178 samples. The purpose of this wine dataset in scikit-learn is to predict the best wine class among 3 classes.

[Feel free to clone the code on your local computer –> GitHub Link]

import numpy as np
import xgboost as xgb
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset
data = load_wine()
X = data.data
y = data.target

scaler = StandardScaler()
X = scaler.fit_transform(X)  # Standardize features

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Train an XGBoost model on the dataset
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgb_model.fit(X_train, y_train)

# Generate predictions (or raw scores) from XGBoost
train_xgb_preds = xgb_model.predict_proba(X_train)  # Probabilities for each class
test_xgb_preds = xgb_model.predict_proba(X_test)

# Step 3: Combine XGBoost predictions (probabilities) with original features
X_train_combined = np.hstack((X_train, train_xgb_preds))
X_test_combined = np.hstack((X_test, test_xgb_preds))

# Step 4: Convert combined data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_combined, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_combined, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

# Step 5: Define the PyTorch neural network for classification
class HybridModel(nn.Module):
    def __init__(self, input_dim):
        super(HybridModel, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, len(np.unique(y)))  # Output layer size is number of classes
        )

    def forward(self, x):
        return self.fc(x)

# Step 6: Initialize the model, loss function, and optimizer
input_dim = X_train_combined.shape[1]  # Combined features (original + XGBoost predictions)
model = HybridModel(input_dim)

criterion = nn.CrossEntropyLoss()  # For classification
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Step 7: Train the PyTorch model
num_epochs = 200
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()

    predictions = model(X_train_tensor)
    loss = criterion(predictions, y_train_tensor)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

# Step 8: Evaluate the model
model.eval()
with torch.no_grad():
    train_predictions = model(X_train_tensor)
    test_predictions = model(X_test_tensor)

    train_pred_labels = torch.argmax(train_predictions, dim=1)
    test_pred_labels = torch.argmax(test_predictions, dim=1)

    train_accuracy = accuracy_score(y_train, train_pred_labels)
    test_accuracy = accuracy_score(y_test, test_pred_labels)

print(f"Final Training Accuracy: {train_accuracy:.4f}")
print(f"Final Testing Accuracy: {test_accuracy:.4f}")

[NOTE: If you face any error related to importing pytorch, try downgrading it using pip3 install torch==2.0.1].

The code looks too hefty. But let me break it down to understand what we are doing:

We started by importing the load_wine() dataset, which is a classification task with 3 classes. Next, we now used the XGBClassifier for classification and generate class probabilities using predict_proba().

Now here comes the most interesting part: the output from XGBoost, such as class probabilities, serves as additional features for the neural network that will be built using PyTorch. These XGBoost outputs are concatenated with the original input features and then fed into the neural network. The neural network learns how to combine these enhanced features, adjusting its parameters to minimize loss and improve predictions.

By incorporating the insights from XGBoost, the neural network can focus on learning more complex patterns, allowing both models to complement each other and potentially boost overall performance. For loss function, we use CrossEntropyLoss for classification tasks instead of Mean Squared Error (MSE). Ultimately, we evaluate the accuracy using accuracy_score() from sklearn.metrics.

Overwhelming, isn’t it? Just start slow by focus on XGBoost at first. Once it is trained, image it as the training data for the neural network. That’s it. The rest is then just coding tactics like calling functions and building the network object.

On a different note, if you are interested in learning more about PyTorch, check out PyPixelate on Instagram who recently started a “100 Days of PyTorch” series, and it would just be perfect if you are starting your journey on the topic. Besides, he also shares informative and amazing posts on Python programming so don’t forget to check his account.

Summary

Gradient Boosting is like the Swiss Army knife of machine learning—it’s powerful, flexible, and when used right, it can take your model from good to great. By focusing on fixing errors and improving over time, it creates a supercharged version of your model that performs well on real-world problems. In this complex coding example above, we used the wine dataset and used XGBoost to generate probabilities. These were further combined with the original features to serve as training data for the PyTorch neural network.

So, what are you waiting for? Give gradient boosting a try, and see your models soar!

But Wait!

It is also fascinating to know ideas and codes from others smart brains out there. Got any cool stories about programming or machine learning or any other project? Get in touch with me on Instagram and let’s chat about making it public so the others get familiar! Don’t forget to follow for more machine learning hacks, code snippets, tips and more importantly, MEMES to make you a pro.