You might have just learned some theory on machine learning and looking for a small, hands-on project to get hold of it. In this blog, we are going to build a model of machine learning using Python step-by-step and understand the steps that we do. By the end of this blog, you hopefully will have an understanding of a basic machine learning model, and more importantly, will have the motivation to solve more advanced projects.
[If you are interested in further sources and courses to pursue machine learning, have a look at my path to machine learning. You’ll find a summary of how I pursued it and what resources I used.]
Note that this blog does not provide a detailed explanation of how a machine-learning model works. I have just provided a step-by-step guide to building a simple machine-learning model and briefly explained wherever necessary.
Following is an overview of the topics we will cover in this blog:
Table of Contents
Why use Machine Learning?
Before we begin building a machine learning model, it is a crucial step to understand if we need one for a problem statement. Machine learning in practice can be computationally and financially expensive. Naively choosing it as a solution to every problem can lead to a loss of time and money. Therefore it is necessary to look at ML models as more than just a black box that magically yields optimum values.
In my opinion, machine learning is a stochastic problem. This means that it finds its application the best in places where we deal with uncertainty. If I ask you to build a model to predict if a number is odd or even, that would be a waste of time as you can easily write a small program that gives us a certain output. However, if I ask you to build a model that predicts what the outcome of a dice will be, you might think about it, do experiments, and train a model based on the readings you make. Hence, it is important to understand why you do what you do.
Brief Introduction to the Exercise
The project we are going to work on is the “hello world” of the machine learning world. We are going to predict the price of a house based on its characteristics. Because this is supervised machine learning, we are going to require data on which we will train our model. So here’s the flow of the process we will follow:
- Importing and cleaning the data
- We import that data that already holds information about the houses. It gives us information about the characteristics of the houses and their corresponding prices.
- Splitting the data into features and labels
- The data contains the features of the houses and their prices. The prices are our ground truth values. Hence, we split them from each other to later compare the prediction with the ground truth.
- Building the model
- When we build a model, what we are doing is defining different mathematical functions. Fundamentally, a function takes in inputs and maps them to some output, using some parameters. When we say that we train a model, what we are doing is optimizing the parameters so that the difference between the mapped value (or predicted value) and the ground truth is as minimal as possible.
- Evaluating the model
- Now that our model is trained at this point, we need to check if it performs well on a random and unseen sample. If our model is trained properly, its prediction should be correct.
Exploratory Data Analysis is out of the scope of this blog, hence we will not go deep into statistics. The main aim of this exercise is to get our hands-on on a small machine learning project.
Importing the data
For this small exercise, we will use an open-source data set from Kaggle. You can download the data from here. After you’ve downloaded it, we will import it in our code as follows:
import pandas as pd
housing_data = pd.read_csv(path_to_your_csv_file)
housing_data.columns
You should now see the data of the houses:
- Suburb’, ‘Address’, ‘Rooms’, ‘Type’, ‘Price’, ‘Method’, ‘SellerG’, ‘Date’, ‘Distance’, ‘Postcode’, ‘Bedroom2’, ‘Bathroom’, ‘Car’, ‘Landsize’, ‘BuildingArea’, ‘YearBuilt’, ‘CouncilArea’, ‘Lattitude’, ‘Longtitude’, ‘Regionname’, ‘Propertycount’.
If you’ve noticed, I highlighted the column “Price” to suggest that that is our label. The other columns will be the features. The features of the houses determine their prices.
Dropping the Null values
Next, we talk about data cleaning. Raw data may contain some null values, which is very common in real cases. Among the various approaches to dealing with missing values, dropping them is one such approach when you do not mind losing some samples from your data. This is what we are going to do. We will drop all the null values present in our data. This is due to the fact that our machine-learning model only understands numeric values. Null values, which are non-numeric can cause errors during training the model.
housing_data = housing_data.dropna(axis=0)
It is important to note that housing_data
can be treated as a normal Python object. Hence, for instance, if you want to read data from the “Price” column, you can just do the following:
print(housing_data.Price)
Splitting the Features from Labels
Now we will sort out the columns that we think are relevant for us and can qualify as features. There are ways to do this like correlation matrix and we will cover that topic some other day. Today, we will let our instincts take over this task.
housing_features_columns = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
housing_features = housing_data[housing_features_columns]
Now we have an object only with the features. Similarly, we will create one that contains only the labels:
housing_prices = housing_data.Price
Building the Model
As I mentioned earlier, building a model means creating different mathematical functions. Fortunately, Python provides us library, sklearn
, that we can import and use to build our model. The model we are building is called a Decision Tree Regressor.
from sklearn.tree import DecisionTreeRegressor
housing_price_model = DecisionTreeRegressor(random_state=1)
housing_price_model.fit(housing_features, housing_prices)
Evaluating the model
After we have trained our model, we want to test it on some random samples. Because our data is limited, we will draw these random samples from our training set.
print("Using the following 5 houses for model evaluation:")
print(housing_features.head())
print("The model predictions are")
print(housing_price_model.predict(housing_features.head()))
Summary
In this exercise, we looked at the necessary steps involved in building a machine-learning model. We build a Decision Tree Regressor, however, some other problems could demand another type of model (random forest, linear regression, logistic regression, etc.) The basic overview though remains the same.
As you proceed in the field of machine learning, you’ll further learn about how to analyze the imported dataset and how to clean it. Such techniques help us build a model that is redundant and stable to any outliers. But for now, you just built your first machine-learning model in python. If you enjoyed it, follow me on the following social media for further interesting projects on programming and machine learning: