0

Example for context:

Does calling * to unpack input put everything into memory? I'm hoping not but just want to confirm my understanding.

input = (x for x in ((1, 'abc'), (2, 'def'))) # generator expression
unzipped = zip(*input) # Does *input get completely unpacked or stay memory efficient?
first_items = next(unzipped)
print(first_items)
# >> (1, 2)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
aksg87
  • 63
  • 1
  • 9
  • 1
    `((1, 'abc'), (2, 'def'))` is not a generator expression, but a tuple (of tuples). – Manfred Mar 07 '22 at 19:04
  • Assuming `input` were a generator, yes, as soon as you unpack, you are consuming the entire generator. It won't yield the individual items lazily as arguments to `zip`, since you are explicitly creating a new iterator via `zip`, which needs arguments as soon as you instantiate it. – Paul M. Mar 07 '22 at 19:09
  • @Manfred good call! posted that late at night -- updated so it's indeed a generator expression I think – aksg87 Mar 08 '22 at 20:02
  • @PaulM. - Thanks for the response. I was hoping it would lazily evaluate the expression. – aksg87 Mar 08 '22 at 20:03
  • @aksg87: Side-note: `input` in your example has to hold a reference to the eagerly produced `tuple` of `tuple`s, so the generator expression doesn't actually save any memory (it actually costs some; if you used `input = ((1, 'abc'), (2, 'def'))` unpacking it would pass the raw `tuple` in, while unpacking the genexpr means making a *new* `tuple` that's a shallow copy of the original). I get this is a toy example, but I figured I should be clear that generator expressions aren't magic; if they loop over an eagerly produced iterable, that iterable exists for (at least) the life of the genexpr. – ShadowRanger Mar 08 '22 at 20:27
  • Yes, it absolutely does. `func(*input)` creates a *tuple* out of the iterable `input` – juanpa.arrivillaga Mar 08 '22 at 20:51
  • 1
    @juanpa.arrivillaga: Note: Technically, it requires a *logical* `tuple`, but it needn't make a *real* `tuple`. Modern CPython's vectorcall protocol means sometimes it just makes a flat C array of the arguments rather than actually making a `tuple`. But yes, from a user's point of view, the distinction is immaterial, the arguments *must* be eagerly realized *before* `func` receives them. – ShadowRanger Mar 08 '22 at 21:06

1 Answers1

4

Unpacking eagerly unpacks the top level of the iterable in question, so in your case, yes, it will run the generator expression to completion before zip is actually invoked, then perform the equivalent of zip((1, 'abc'), (2, 'def')). If the iterables inside the generator were themselves lazy iterators though, zip won't preread them at all, which is usually the more important savings. For example, if input is defined with:

input = (open(name) for name in ('file1', 'file2'))

then while:

unzipped = zip(*input)

does eagerly open both files (so you may as well have used a listcomp; the genexpr didn't really save anything), it doesn't read a single line from either of them. When you then do:

first_items = next(unzipped)

it will read exactly one line from each, but it doesn't read the rest of the file until you ask for more items (technically, under the hood, file objects do block reads, so it will read more than just the line it returns, but that's an implementation detail; it won't slurp the whole of a 10 GB file just to give you the first line).

This is the nature of * unpacking; the receiving function needs to populate its arguments at the moment it is called. If you define:

def foo(a, b):
    print(b)
    print(a)

it would be very strange if a caller could do foo(*iterator), the iterator raises an exception when it produces the value for a, but you only see it when you do print(b) (at which point it has to advance the iterator twice to lazily populate b). No one would have the foggiest idea what went wrong. And literally every function would have to deal with the fact that simply loading its arguments (without doing anything with them) might raise an exception. Not pretty.

When it's reasonable to handle lazy iterators (it isn't for zip; the very first output would need to read from all the arguments anyway, so at best you'd delay the realization of the arguments from the moment of construction to the first time you extract a value from it, saving nothing unless you build a zip object and discard it unused), just accept the iterator directly. Or do both; itertools' chain allows both an eager:

for item in chain(iter1, iter2):

and a lazy:

for item in chain.from_iterable(iter_of_iters):

call techniqe, precisely because it didn't want to force people with an iter_of_iters to realize all of the iterators in memory before it chained a single value from the first one (which is what for item in chain(*iter_of_iters): would require).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Good answer, but I'm not sure I follow why this is possible/reasonable for `chain` but not `zip`. Why would it be unreasonable to have some `zip_from_iterable`? – juanpa.arrivillaga Mar 08 '22 at 20:55
  • 1
    @juanpa.arrivillaga: `zip`, on its very first use, must have access to *all* the iterables provided. `next(zip(itera, iterb, iterc))` reads a `a, b, c` from all three inputs; it'll need all three almost immediately. `chain` doesn't need to; `next(chain(itera, iterb, iterc))` just reads the first value from `itera`, and doesn't touch `iterb` and `iterc` until it's finished producing all the values of `itera`. So if you're, say, `chain`ing 5000 files, and your open file limit is 1000, you gain meaningful benefits by not even opening file `n+1` until `n` is consumed. – ShadowRanger Mar 08 '22 at 21:00
  • 1
    With `zip_from_iterable`, all you'd save is the `*`; sure, I suppose there might be a *tiny* benefit to avoiding unpacking and letting `zip_from_iterable` iterate the iterable itself, but internally, it would have to store all the iterators to satisfy the very first `next()` anyway, so the savings would be a micro-optimization at best (instead of a Python `tuple`, it might store a plain C array of args); in the 5000 files case, it would need them all `open` to produce the first 5000-tuple (containing the first line of each of the files), so you'd blow your open file limit immediately anyway. – ShadowRanger Mar 08 '22 at 21:03
  • It seems obvious now. Thanks for elaborating – juanpa.arrivillaga Mar 08 '22 at 21:08
  • This makes a lot of sense @ShadowRanger - Thanks! – aksg87 Mar 09 '22 at 22:54