How are Python Generators useful in Machine Learning?

In Python, a generator function allows us to iterate over particular elements such as a list, and returns the following element every time we call it. Now, one might think that the same could be achieved by just looping over those list elements, however, when it comes to usability, there is a small difference between the two approaches that we will see today. More precisely, we will consider a case of generators that is often used in developing a machine learning model, and python generators prove to be more appropriate in such cases, rather than loops.

So get in your workspace, open up your favorite IDE (notepad supremacy!), and let us explore Python generators together. Before jumping directly on those chunky words like functions, keywords, memory, etc., let us consider a small scenario.

Story Time

You are planning to cook different recipes for the next 10 days. For every recipe, you need one egg as a part of the ingredients. You would buy an entire box with 10 eggs in one visit to the supermarket, rather than going to the supermarket every day for the next 10 days (At least that’s what a sane person would do, but I doubt it for programmers!). Anyway, so you buy those eggs, store them in the refrigerator, and take one out every time you need it. In this way, you don’t have to visit the supermarket every day; it is much more convenient and efficient to buy all and store them at once.

machine learning

In this scenario, the refrigerator is our generator (hurts to say this being a mechanical engineer). It ‘generates’ the value we need at the time we call it. To make this clear, let us now go into programming mode.

So what actually are Generators?

Python generators, as mentioned at the beginning of this blog, are functions that allow us to iterate over particular elements such as a list, and return the following element every time we call it. They do not store all of the values at once in memory but generate a sequence of values when we call them.

Unlike those traditional python functions, generators used the yield keyword instead of the traditional return keyword. As a simple example, let us consider the following:

myList = [0,1,2,3,4,5,6,7,8,9]

def get_nums():
    for i in range(10):
        yield myList[i]

numbers_gen = get_nums()
print(next(numbers_gen))
print("Printing a line in between...")
print(next(numbers_gen))

We create a list with numbers from 0 to 9. As you see in the generator function get_nums(), we iterate through the list to generate the elements within. Note that we haven’t stored the output in any variable yet.

Now we create an instance of this generator and name it numbers_gen. We call the first element of the generator using the keyword next(). This gives us the output 0. The statement "Printing a line in between…"is printed after that. Using next() again, it return the value of the subsequent element of the list, i.e., 1. The output of the code above will be as follows:

Output:
0
Printing a line in between...
1

As it can be noted, a generator yields the subsequent value of a list or any other object everytime we call it. But what happens if we keep calling it and keep asking it to yield values? In such case, the generator will throw the following exception:

Output:
0
1
2
3
4
5
6
7
8
9
Traceback (most recent call last):
  File "/home/workspace/src/tests/generator.py", line 10, in <module>
    print(next(numbers_gen))
          ~~~~^^^^^^^^^^^^^
StopIteration

After the last element is yielded, the generator throws the exception of StopIteration. This means that it has retrieved all the elements from the list and there is nothing left to be generated.

So now the question is, where do we use generators? Python generators are appropriate when we do not want to process all the data together. For example, while developing a machine learning model hat involves image data, we do not want to preprocess all the images in the same manner. we will see this through an example in the next section.

Python Generators in Machine Learning

When a machine learning model is generated based on image data, it involves training the model over a massive data set. It also involves preprocessing the images. During preprocessing, the images in the data are called individually and different preprocessing techniques are applied to them, like image rotation, image zooming, image scaling, etc. Generators allow us to accomplish this task efficiently. Let us go through the following example together. Feel free to code along.

Create a folder with the name data in your current workspace. Add a few png images in them and name them numerically, as in, 1.png, 2.png, 3.png etc. Next, we create a Python script and create a generator function called image_generator. This function takes in the path of the folder that contains our images data.

import numpy as np
import cv2
import os

IMAGE_PATH = "data/"

def image_generator(path):
    i = 1
    while True:
        img_path = os.path.join(path, str(i)+".png")
        if os.path.exists(img_path):
            image = cv2.imread(img_path)
            yield image
        else:
            break
        i += 1

Assume that the number of images in the folder is unknown. Thus, we create a while loop and read the image if it exists. This condition is checked on every iteration. If not, the loop will be broken and thereby be ended.

Next, we create another function to preprocess the data. This function calls the above generator function and generates the images individually using next() as seen in line 5 of the code below. The image is read using the imshow() function of OpenCV.

def image_generation_func():
    image_gen = image_generator(IMAGE_PATH)
    while True:
        try:
            image = next(image_gen)
            cv2.imshow("image", image)
            cv2.waitKey(1000)
        except StopIteration:
            print("Generated all images.")
            break

image_generation_func()

As a further step, the image can be passed to some different function that processes the image randomly, for example turning the image into grayscale, scaling the image, zooming the image, etc. In this way, generators allow us to call the data when we require, thereby providing us the efficiency and flexibility in programming.

Conclusion

As we conclude this blog on Python generators, here’s a brief takeaway: generators are your tools in battling memory bloat and keeping your code elegant. If you are working with machine learning, do not hesitate to use generator functions to load your data. It gets really handy when it comes to data processing.

Feel free to use this concept in your code and share with me. You can get in touch with me on my social media:

To stay updated with such interesting content, sign up to my very free, monthly newsletter:

Leave a Reply