When To Use Generators in Python
Today, “what are Generators in Python” and “what are Generators used for in Python” are some of the most popular Python interview questions.
Often, a Generator is considered one of the slightly more intermediate concepts in Python. If you are new to learning Python, you may not have come across Generator. Here’s a tip, it has something to do with the use of yield
statements inside a function.
In this post, I am going to highlight some of the use cases, reasons, and advantages of using Generators in Python. In short, you should consider using Generators when dealing with large datasets with memory constraints.
Let’s dive a little bit deeper, shall we?
TL;DR
- Consider using a Generator when dealing with a huge dataset
- Consider using a Generator in scenarios where we do NOT need to reiterate it more than once
- Generators give us lazy evaluation
- They are a great way to generate sequences in a memory-efficient manner
Why Should I Care About Using Generators
Memory constraints
To understand why you should use Generators, we have to first understand that computers have a finite amount of memory (RAM). Whenever we are storing or manipulating variables, lists, etc., all of that is stored inside our memory.
You might ask, why do computer programs store them in memory? Because it’s the fastest way for us to write and retrieve data.
Scenarios
Have you ever had to work with a list so large that you run into MemoryError
? Perhaps, you have tried reading rows from a super large Excel (or .csv
) file. All I remember was that performing these tasks is painfully slow or impossible.
What Is a Generator Function
To put it simply, a Generator function is a special kind of function that returns multiple items. The point here is that the items are returned one by one rather than all at once.
The main difference between a regular function and a Generator function lies in the use of return
and yield
statements respectively in Python.
Generators give you lazy evaluation
You may have come across this statement. But, what does it really mean?
If you are familiar with Iterator, a Generator function is essentially a function that behaves just like that.
Behind the scene, Generators don’t compute the value of each item when being instantiated. Rather, they compute it only when we ask for it. This is what people mean by Generators giving you lazy evaluations.
As a result, Generators allow us to process and deal with one value at a time without having to load everything in memory first.
When and Where Should I Use Generators
Generators are great when you encounter problems that require you to read from a large dataset. Reading from a large dataset indirectly means our computer or server would have to allocate memory for it.
The only condition to remember is that a Generator can only be iterated once. In other words, as long as we do not need the previous value from our dataset, we can always use Generator.
Reading sizable CSV
Another common use case of using Generators is when we are working with large files such as Excel or CSV documents. Without using a Generator function, here’s how we might write it:
Upon running the example above, we may experience some slowness or even MemoryError
depending on our computers.
Looking at the code example above, to generate the result, the read_csv_from_regular_fn
would open our CSV file and loads everything in memory in an instance.
This is not a good solution when working with larger files than our available memory. Alternatively, we could do this:
In this scenario, we essentially use read_csv_from_generator_fn
as our Generator function. This new Generator opens our large CSV file, loops through every row, and yields each row at a time rather than all at once.
Here, we would not run into any MemoryError
or even any slowness due to memory constraints when reading data from our large_dataset.csv
.
To check the memory usage in bytes, we could do the following:
import sys
print(sys.getsizeof(read_csv_from_generator_fn())) # 112 bytes
print(sys.getsizeof(read_csv_from_regular_fn())) # 1624056 bytes
Iterating through a large list (array)
Another example where Generators are often used is where we intend to process values from a large list:
# Example 1
nums_list_comprehension = [i * i for i in range(100_000_000)]
sum(nums_list_comprehension) # 333333328333333350000000
Depending on your computer, you may encounter MemoryError
or at least a couple of seconds of slowness when evaluating the expression above.
Similar to list comprehensions, the Generator expression allows us to quickly create a Generator object without having to use the yield
statement.
To cope with our memory constraint, we could turn the code example above into a Generator expression. This line of code below evaluates almost immediately:
# Example 2
nums_generator = (i * i for i in range(100_000_000)) # <generator object <genexpr> at 0x106ecc580>
sum(nums_generator) # 333333328333333350000000
In Example 1, i ** i
for the entire range of 100_000_000
is evaluated and stored in memory beforehand. It returns a full list.
In Example 2, i ** i
is only evaluated when being iterated, one at a time. It returns a Generator expression.
Remember, Generators don’t compute the value of each item when being instantiated.
The differences in memory usage are below:
import sys
print(sys.getsizeof(nums_generator)) # 112 bytes
print(sys.getsizeof(nums_list_comprehension)) # 835128600 bytes
When NOT To Use Generators
We need the previous values
A Generator can only be iterated once.
The example below shows that the Generator expression from nums_generator
can only be iterated once. Using sum
on it for the second time resulted in zero as the Generator was exhausted.
# Continuing from Example 2
sum(nums_generator) # 333333328333333350000000
sum(nums_generator) # Calling nums_generator the second time gives us 0, because it can only be iterated once.
Dealing with relatively small files
When dealing with relatively small files or lists, we may not want to use Generator as it might actually slow us down.
We can use our previous examples cProfile
to profile the performance differences between list comprehension and Generator expression when summing the values up.
From our cProfile
result above, we can tell that using list comprehension is a lot faster provided we don’t run into memory constraints.
Evidently, if memory is not an issue, we should stick with using regular functions or list comprehensions.
Conclusion
In summary, Generator is an amazing tool in Python given the scenario where we do not need to reiterate it more than once.
As Generators give us a lazy evaluation, they are a great way to generate sequences in a memory-efficient manner. We should definitely consider using Generator when dealing with huge datasets to optimize our program.
Thank you for reading!