How to sample data *CORRECTLY* into test set?

Splitting the available data into training and test sets help us evaluate the performance of the model. The model is trained using the training set and its generalization capability is evaluated on the test set. The test set is usually a small percentage of the original data set, say 20%. The remaining 80% of the data is used for training the model.

Achieving low error on training as well as on test set may sound like a splendid result and may lead us to think that the model generalizes well and is ready for deployment. In practice, however, the model might demonstrate some poor results. Such an issue could arise if the original data is not split appropriately.

Understanding Stratified Sampling

Say that we are conducting a survey in a college campus on how many people prefer watching Formula 1 over MotoGP. Of the total number of students available in the campus, 55% are girls and the other 45% are boys. We are supposed to conduct the survey on 1000 random students. Conducting survey on purely random students might result in skewed data. We must make sure that the data is representative of the original scenario. This means that those 1000 students, on whom the survey is conducted, must contain 55% girls and 45% boys. Such a sampling is called stratified sampling.

Stratified Sampling using Sklearn

Say we have a data set that contains information of houses. The aim is to predict the value of a house based on the features. You are told that the feature income_category is important to make the prediction. Hence, you make sure that that particular feature is evenly distributed in train as well as the test set. Sklearn provides a class called StratifiedShuffleSplit that makes this task easier.

import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

data = pd.read_csv("housing.csv")

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


for train_index, test_index in split.split(data, data['income_category']):
    train_set = data.loc[train_index]
    test_index = data.loc[test_index]

The split.split() functions takes the entire data as its first argument and the ‘important’ feature as the second. The following is the distribution of income categories in original set, train and test set.

stratified sampling

As seen in the image above, all the proportion of all the categories throughout data is the same. Such sampling technique is called stratified sampling. In random sampling, such proportion may or may not occur as the sampling process, as the name says, is completely random. Stratified sampling help us maintain the proportion similar to that of the actual situation.

This Post Has One Comment

Leave a Reply