1

I'm writing a blog post about generators in the context of screenscaping, or making lots of requests to an API, based on the contents of a large-ish text file, and after reading this nifty comic by Julia Evans, I want to check something.

Assume I'm on linux or OS X.

Let's say I'm making a screenscraper with scrapy (it's not so important to know scrapy this qn, but it might be useful context)

If I have an open file like so, and I went to be able to return a scrapy.Request for every line I pull out of a largeish csv file.

    with open('top-50.csv') as csvfile:

        urls = gen_urls(csvfile)

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

gen_urls is a function that looks like this.

    def gen_urls(file_object):

        while True:

            # Read a line from the file, by seeking til you hit something like '\n'
            line = file_object.readline()

            # Drop out if there are no lines left to iterate through
            if not line:
                break

            # turn '1,google.com\n' to just 'google.com'
            domain = line.split(',')[1]
            trimmed_domain = domain.rstrip()

            yield "http://domain/api/{}".format(trimmed_domain)

This works, but I want to understand what's happening under the hood.

When I pass the csvfile to the gen_urls() like so:

    urls = gen_urls(csvfile)

In gen_urls my understanding is that it works by pulling out a line at a time in the while loop with file_object.readline(), then yielding with yield "http://domain/api/{}".format(trimmed_domain).

Under the hood, I think is a reference to some file descriptor, and readline() is essentially seeking forwards through the file, until it finds the next newline \n character, and the yield basically pauses this function until the next call to __.next__() or the builtin next(), at which point it resumes the loop. This next is called implicitly in the loop in the snippet that looks like:

    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

Because we're only pulling a line at the time from the file descriptor then 'pausing' the function with yield , we don't end up with loads of stuff in memory. Because scrapy uses an evented model, you can make a bunch of scrapy.Request objects without them all immediately sending off a bajillion HTTP requests and saturating your network. This way, scrapy is also able to do useful things like throttle how quickly they're sent, how many are sent concurrently, and so on.

This about right?

I'm mainly looking for a mental model that helps me think about using generators in python and explain them to other people, rather than all the gory details, as I've been using them for ages, without thinking through what's happening, and I figured asking here might shed some light.

Chris Adams
  • 2,721
  • 5
  • 33
  • 45

0 Answers0