How do I take advantage of Python generators when reading in a huge file and parsing by word?

Question

Here is the relevant code I have. It is using a generator to get the words from the file. However, the words are first stored into a variable before entering a function. Is this correct?

Does this take advantage of the generator functionality?

def do_something(words):
    new_list = {}
    for word in words:
        // do stuff to each word
        // then add to new_list
    return new_list

def generate_words(input_file):
    for line in input_file:
        for word in line.split(' '):
            // do stuff to word
            yield word

if __name__ == '__main__':
    with open("in.txt") as input_file:
        words = generate_words(input_file)
        do_something(words)

Thank you

@jamlak I was thinking that since do_something has to wait for the words variable to be "generated", it is slower than adding words to the list in do_something() as they were yielded instead. Does that make sense? Am I missing something? — The Puma, Apr 19 '13 at 06:07
That's not a list, it's a dictionary. `do_something` doesn't have to wait for the words variable to be generated, it is generating them one at a time during the function. — jamylak, Apr 19 '13 at 06:11

score 4 · Accepted Answer · answered Apr 19 '13 at 06:07

4

When you make words = generate_words(input_file), you are simply giving it a reference to the newly created generator. When you run do_something, that's when the generator is actually iterated through, words is just a reference to it. So the answer is yes, you are taking advantage of generators.

answered Apr 19 '13 at 06:07

jamylak

128,818
30
231
230

Thank you, that's exactly what I wanted to know. – The Puma Apr 19 '13 at 06:08

score 2 · Answer 2 · edited May 23 '17 at 11:57

2

The code looks fine. What is being stored in words is a fresh generator prepared to run the code in generate_words; the code will only actually run when the for word in words: is triggered. If you want to know more, this SO question has a whole heap of information.

edited May 23 '17 at 11:57

Community

1
1

answered Apr 19 '13 at 06:09

michaelb958--GoFundMonica

4,617
7
31
35

score -1 · Answer 3 · edited Apr 22 '13 at 18:02

-1

There is no advantage of using generators in given example. The main purpose is reduce memory usage.

In the code:

for line in input_file:

line already beed read from file and consumed memory. Then split operation create new list and memory been consumed one more time.

So all you have to do is iterate through list items.

While usage of generators will lead creating generator object which yield objects from existing list. It is completely useless.

edited Apr 22 '13 at 18:02

Blender

289,723
53
439
496

answered Apr 19 '13 at 06:45

emcpow2

852
6
19

You seem to disagree with all the other posters. Can you explain further? – The Puma Apr 20 '13 at 17:13
1

I disagree. Loading the file line-by-line uses much less memory than loading the entire file into memory (unless the entire file *is* one line). – Blender Apr 22 '13 at 09:13
2

The point is that we are reading each `line` one by one... I agree that `for word in line.split(' '): yield word` requires a list from `line.split` to be loaded into memory but when dealing with individual lines, this is a **very quick** operation since that list is **tiny**. `yield`ing from that list is not a costly operation, and you can't shun generators for that. Imagine loading every line in the file into memory... and then performing operations on it without generators, that would be very memory inefficient and slow as well for huge files – jamylak Apr 22 '13 at 09:14
thank for comments. I know about that but wanted more discussion – emcpow2 Apr 22 '13 at 10:07

How do I take advantage of Python generators when reading in a huge file and parsing by word?

3 Answers3