8

I wrote a class that reads a txt file. The file is composed of blocks of non-empty lines (let's call them "sections"), separated by an empty line:

line1.1
line1.2
line1.3

line2.1
line2.2

My first implementation was to read the whole file and return a list of lists, that is a list of sections, where each section is a list of lines. This was obviously terrible memory-wise.

So I re-implemented it as a generator of lists, that is at every cycle my class reads a whole section in memory as a list and yields it.

This is better, but it's still problematic in case of large sections. So I wonder if I can reimplement it as a generator of generators? The problem is that this class is very generic, and it should be able to satisfy both of these use cases:

  1. read a very big file, containing very big sections, and cycle through it only once. A generator of generators is perfect for this.
  2. read a smallish file into memory to be cycled over multiple times. A generator of lists works fine, because the user can just invoke

    list(MyClass(file_handle))

However, a generator of generators would NOT work in case 2, as the inner objects would not be transformed to lists.

Is there anything more elegant than implementing an explicit to_list() method, that would transform the generator of generators into a list of lists?

crusaderky
  • 2,552
  • 3
  • 20
  • 28
  • Have you tried working with readline. This way only a single line line is read; delimited by new line. This is a good way to load small data in the memory, unless your lines itself are huge. – Vivek Sep 26 '13 at 16:20
  • @Vivek My lines are very complex, and from each of them I generate an object that validates the line and whose status depends on the previous lines as well. Exposing the internal formatting of the file to the user is not an option. – crusaderky Sep 26 '13 at 16:24
  • can you give a sample input line... – Vivek Sep 26 '13 at 16:26
  • 1
    What exactly is the question? How to write the generator of generators, or how to create the list of lists for small files, assuming you have the generator? For the latter case: What about `[list(section()) for section in MyClass(file_handle)]`? – tobias_k Sep 26 '13 at 16:28
  • @tobias_k the question is how to create the list of lists from the generator of generators, in a way that is reasonably transparent and elegant to the user. Your example is how I would implement the explicit to_list() method I mentioned, but I was wondering if there is anything that doesn't require an explicit to_list() method() to be called by the user? In other words, I want to avoid the library to crash mysteriously as soon as a distracted user does list(MyClass(file_handle)). – crusaderky Sep 26 '13 at 16:34

2 Answers2

8

Python 2:

map(list, generator_of_generators)

Python 3:

list(map(list, generator_of_generators))

or for both:

[list(gen) for gen in generator_of_generators]

Since the generated objects are generator functions, not mere generators, you'd want to do

[list(gen()) for gen in generator_of_generator_functions]

If that doesn't work I have no idea what you're asking. Also, why would it return a generator function and not a generator itself?


Since in the comments you said you wanted to avoid list(generator_of_generator_functions) from crashing mysteriously, this depends on what you really want.

  • It is not possible to overwrite the behaviour of list in this way: either you store the sub-generator elements or not

  • If you really do get a crash, I recommend exhausting the sub-generator with the main generator loop every time the main generator iterates. This is standard practice and exactly what itertools.groupby does, a stdlib generator-of-generators.

eg.

def metagen():
    def innergen():
        yield 1
        yield 2
        yield 3

    for i in range(3):
        r = innergen()
        yield r

        for _ in r: pass
  • Or use a dark, secret hack method that I'll show in a mo' (I need to write it), but don't do it!

As promised, the hack (for Python 3, this time 'round):

from collections import UserList
from functools import partial


def objectitemcaller(key):
    def inner(*args, **kwargs):
        try:
            return getattr(object, key)(*args, **kwargs)
        except AttributeError:
            return NotImplemented
    return inner


class Listable(UserList):
    def __init__(self, iterator):
        self.iterator = iterator
        self.iterated = False

    def __iter__(self):
        return self

    def __next__(self):
        self.iterated = True
        return next(self.iterator)

    def _to_list_hack(self):
        self.data = list(self)
        del self.iterated
        del self.iterator
        self.__class__ = UserList

for key in UserList.__dict__.keys() - Listable.__dict__.keys():
    if key not in ["__class__", "__dict__", "__module__", "__subclasshook__"]:
        setattr(Listable, key, objectitemcaller(key))


def metagen():
    def innergen():
        yield 1
        yield 2
        yield 3

    for i in range(3):
        r = Listable(innergen())
        yield r

        if not r.iterated:
            r._to_list_hack()

        else:
            for item in r: pass

for item in metagen():
    print(item)
    print(list(item))
#>>> <Listable object at 0x7f46e4a4b850>
#>>> [1, 2, 3]
#>>> <Listable object at 0x7f46e4a4b950>
#>>> [1, 2, 3]
#>>> <Listable object at 0x7f46e4a4b990>
#>>> [1, 2, 3]

list(metagen())
#>>> [[1, 2, 3], [1, 2, 3], [1, 2, 3]]

It's so bad I don't want to even explain it.

The key is that you have a wrapper that can detect whether it has been iterated, and if not you run a _to_list_hack that, I kid you not, changes the __class__ attribute.

Because of conflicting layouts we have to use the UserList class and shadow all of its methods, which is just another layer of crud.

Basically, please don't use this hack. You can enjoy it as humour, though.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
0

A rather pragmatic way would be to tell the "generator of generators" upon creation whether to generate generators or lists. While this is not as convenient as having list magically know what to do, it still seems to be more comfortable than having a special to_list function.

def gengen(n, listmode=False):
    for i in range(n):
        def gen():
            for k in range(i+1):
                yield k
        yield list(gen()) if listmode else gen()

Depending on the listmode parameter, this can either be used to generate generators or lists.

for gg in gengen(5, False):
    print gg, list(gg)
print list(gengen(5, True))
tobias_k
  • 81,265
  • 12
  • 120
  • 179