How can you feed an iterable to multiple consumers in constant space?
TLDR
Write an implementation which passes the following test in CONSTANT SPACE, while
treating min
, max
and sum
as black boxes.
def testit(implementation, N):
assert implementation(range(N), min, max, sum) == (0, N-1, N*(N-1)//2)
Discussion
We love iterators because they let us process streams of data lazily, allowing the treatment of huge amounts of data in CONSTANT SPACE.
def source_summary(source, summary):
return summary(source)
N = 10 ** 8
print(source_summary(range(N), min))
print(source_summary(range(N), max))
print(source_summary(range(N), sum))
Each line took a few seconds to execute, but used very little memory. However, It did require 3 separate traversals of the source. So this will not work if your source is a network connection, data acquisition hardware, etc. unless you cache all the data somewhere, losing the CONSTANT SPACE requirement.
Here's a version which demonstrates this problem
def source_summaries(source, *summaries):
from itertools import tee
return tuple(map(source_summary, tee(source, len(summaries)),
summaries))
testit(source_summaries, N)
print('OK')
The test passes, but tee
had to keep a copy of all the data, so the space usage goes up from O(1)
to O(N)
.
How can you obtain the results in a single traversal with constant memory?
It is, of course, possible to pass the test given at the top, with O(1)
space usage, by cheating:
using knowledge of the specific iterator-consumers that the test uses. But
that is not the point: source_summaries
should work with any iterator
consumables such as set
, collections.Counter
, ''.join
, including any
and all that may be written in the future. The implementation must treat them
as black boxes.
To be clear: the only knowledge available about the consumers is that each one consumes one iterable and returns one result. Using any other knowledge about the consumer is cheating.
Ideas
[EDIT: I have posted an implementation of this idea as an answer]
I can imagine a solution (which I really don't like) that uses
preemptive threading
a custom iterator linking the consumer to the source
Let's call the custom iterator link
.
- For each consumer, run
result = consumer(<link instance for this thread>)
<link instance for this thread>.set_result(result)
on a separate thread.
- On the main thread, something along the lines of
for item in source:
for l in links:
l.push(item)
for l in links:
l.stop()
for thread in threads:
thread.join()
return tuple(link.get_result, links)
link.__next__
blocks until thelink
instance receives.push(item)
in which case it returns the item.stop()
in which case it raisesStopIteration
The data races look like a nightmare. You'd need a queue for the pushes, and probably a sentinel object would need to be placed in the queue by
link.stop()
... and a bunch of other things I'm overlooking.
I would prefer to use cooperative threading, but consumer(link)
seems to be
unavoidably un-cooperative.
Do you have any less messy suggestions?