How to Parse Data efficiently in Python

If you have solved some machine learning exercises, you might have come across a scenario where you need to parse the information available in the form of a DataFrame, like the Diabetes Prediction Exercise on Kaggle. Parsing the data can consume time significantly if the number of samples is high. The reason behind this lies in the paradigm on which the dataframe is based. This influences the memory that is consumed during parsing the data. Hence, it becomes crucial that we parse data efficiently to reduce memory load.

Pattern that is available in tabular data can be categorized into row-major format and column-major format. Row-major format means that consecutive rows are saved in memory. So if you are parsing some row in the dataframe, parsing the following one will be quick. Similarly, colum-major format allows us to parse the consecutive column quickly.

In machine learning, we often have data available in a CSV file that have samples as rows and features as columns. Understandably, we would use row-major format to access the samples and column-major to access the features. This is what we will cover in this blog:

Pandas

Pandas is an open-source library available in Python that makes reading and writing a dataframe an easy task. In fact, it can be considered a default library in machine learning when it comes to parsing data to develop a model. Unfortunately, many people often use Pandas in a wrong way, and end up consuming more computing memory, eventually more time to parse the data.

Let us consider the diabetes example and see how parsing the dataframe using Pandas consume more time.

import pandas as pd

data = pd.read_csv("diabetes.csv")

In this trial, we will parse the rows of the data and monitor the time taken.

start_time = time.time()
for row in range(len(data)):
  for element in data.iloc[row]:
    pass
end_time = time.time()-start_time
print("Time taken: %f seconds" % end_time)

Output:

Time taken: 0.069382 seconds

Next, we will do the same for the columns of the dataframe.

start_time = time.time()
for col in data.columns:
  for element in data[col]:
    pass
end_time = time.time()-start_time
print("Time taken: %f seconds" % end_time)

Output:

Time taken: 0.001905 seconds

Notice the time taken for parsing rows and columns using pandas. This is the case with 768 samples. The issue can amplify when we work with data that contains hundreds of thousands of samples. Thus, parsing the data using Pandas comes at an expense of time. Now let us look at its alternative.

Numpy

The popular mathematical Python library numpy brings us the solution to this problem. The idea is to convert the pandas dataframe into a numpy array before parsing the data.

import numpy as np

start_time = time.time()
for row in range(rows):
  for element in new_data[row,:]:
    pass
end_time = time.time()-start_time
print("Time taken: %f seconds" % end_time)

Output:

Time taken: 0.001629 seconds

The time taken in parsing the rows is 0.001629 seconds, which is significantly lower than that taken by Pandas (0.069382 seconds). Let us do the same for columns.

start_time = time.time()
for col in range(cols):
  for element in new_data[:,col]:
    pass
end_time = time.time()-start_time
print("Time taken: %f seconds" % end_time)

Output:

Time taken: 0.000634 seconds

0.000634 seconds in numpy as compared to 0.001905 seconds for parsing columns! Certainly numpy is a better option than pandas when we want to parse a dataframe.

Summary

In this blog, we understood how data parsing techniques can influence the memory consumption, as a result costing more time for the process. Parsing a data in numpy turns out to be more efficient than in pandas. The table below summarizes the time taken by respective formats:

Time taken to parse rows (s)Time taken to parse columns (s)
Pandas0.0693820.001905
Numpy0.0016290.000634

Hence, proper parsing techniques reduce the load on memory, making our machine learning practices more efficient. Pandas may be the go-to library to deal with data but it also comes with some limitations as we saw here. If you are interested in other functionalities of Pandas, have a look at this guide to pandas where you’ll understand basic Pandas function useful in data preprocessing. You can show your support by subscribing to MLS Newsletter where you will be notified of exciting blog articles and more news on the Machine Learning Site:

Leave a Reply