1

I want to understand how heapq.merge() works with infinite generators. Consider this example:

>>> from heapq import merge
>>> from itertools import count
>>> m = merge(count(0, 2), count(1, 2))
>>> for _ in range(10):
...     print(next(m))
...
0
1
2
3
4
5
6
7
8
9 

The docs state that it does not pull the data into memory all at once. But how does it consume each of the infinite generators?

planetp
  • 14,248
  • 20
  • 86
  • 160
  • The inputs are assumed to be sorted, so it just has to check the first element of each and yield the lower one, kind of like merge sort. – tobias_k Jul 08 '20 at 11:20
  • BTW, are you asking "how is this possible" or "how is it implemented"? – tobias_k Jul 08 '20 at 11:21
  • `check the first element of each` - you can't do it with generators, can you? Yes, I'm interested in the implementation. – planetp Jul 08 '20 at 11:30
  • 1
    You can not "peek" the first element, but you can "pop" / yield it and remember it for later in case it is not the smallest. – tobias_k Jul 08 '20 at 11:41
  • Can you clarify your question? What do you mean by "how does it consume each of the infinite generators"? Your question *already* consumes an infinite generator, so you seem to know how to do this. Are you asking how the algorithm to select the next item from all generators works? – MisterMiyagi Jul 08 '20 at 11:44

1 Answers1

1

A very simple implementation of such a function could look like the following. Note, though, that for the sake of simplicity this does not handle any special (and not-so-special) cases like empty or exhausted iterables.

def merge(*iterables):
    heap = [(next(it), i) for i, it in enumerate(iterables)]
    heapq.heapify(heap)
    while heap:
        val, i = heapq.heappop(heap)
        yield val
        heapq.heappush(heap, (next(iterables[i]), i))

It works like this:

  • get the first element from each sorted iterable, together with that iterable's index in the list
  • yield the next smallest element from that heap
  • add the next element from the iterable with the same index as the one just yielded to the heap

The actual implementation is a bit more involved, but seems to work roughly along the same lines. You can get the location of your local source with heapq.__file__, which on my system is /usr/lib/python3.6/heapq.py, and check yourself.

tobias_k
  • 81,265
  • 12
  • 120
  • 179