What is a Generator in Python and how to use them to work with large datasets in a Pythonic fashion.
A Generator is a function that returns a lazy iterator. Said, differently, you can iterate over the iterator, but it is lazy, that is, it will first execute the code when iterated.
A simple example could be as follows.
def my_generator():
# Do something
yield 5
# Do something more
yield 8
# Do something else
yield 12
Then you can iterate over the generator as follows.
for item in my_generator():
print(item)
This will print 5, 8, and 12.
At first sight, this doesn’t look very useful. But let’s undestand it a bit better what happens.
When we make the first iteration in the for-loop, then it will execute the code in the my_generator function until it reaches the first yield.
Then it stops and returns the value after yield.
In the next iteration, it will continue where it left off and execute until it reaches the next yield.
Then it stops and returns the value after yield.
And so forth until no more yield statements are there.
Now why is that powerful?
Let’s explore some use-cases.
If you have a pipeline of work items, where there is a pre-processing step. Often you would combine the pre-processing together with the actual processing. But actually, it will make your code more readable and maintainable if you divide it up.
Explore the example.
def pre_process_items():
for row in open('data.txt'):
row = row.strip()
freq = {c: row.count(c) for c in set(row)}
yield freq
freq = {}
for item in process_items():
for k, v in item.items():
freq[k] = freq.get(k, 0) + v
In this case you prepare the work item in pre_process_items().
If you want to learn about the Dict Comprehension read this guide.
This way you divide your code into a piece that prepares data and another one where you process the data. This makes the code easier to understand.
Often you have a list of work possible work items that need to be processed, but only a few of them actually need to be processed.
A simple example is processing a Log-file, where we are only interested in a specific log-level.
def get_warnings(log_file):
for row in open(log_file):
if 'WARNING' in row:
yield row
for warning in get_warnings('log_file.txt'):
print(warning)
This example shows how this simplifies how to filter.
If you want to learn more about text processing in Python read this guide.
A great use-case is if you need to make an API call. This might require setup and filtering the result and possible reformatting.
import pandas_datareader as pdr
from datetime import datetime, timedelta
def get_stocks(tickers):
d = datetime.now() - timedelta(days=7)
for ticker in tickers:
data = pdr.get_data_yahoo(ticker, d)
close_price = list(data['Close'])
yield close_price
for prices in get_stocks(['AAPL', 'TWTR']):
print(prices)
The advantage of this, is, that it will first make the call to the API when you need the data (lazy load). Say, you have a list of 1000s of tickers, if you had to make all the calls before you can start to process, it could be a long waiting time.
With Generators you can utilize the power of lazy-loading.
If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.
The course is structured with the following resources to improve your learning experience.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…