How can you feed an iterable to multiple consumers in constant space?

Question

TLDR

Write an implementation which passes the following test in CONSTANT SPACE, while treating min, max and sum as black boxes.

def testit(implementation, N):
    assert implementation(range(N), min, max, sum) == (0, N-1, N*(N-1)//2)

Discussion

We love iterators because they let us process streams of data lazily, allowing the treatment of huge amounts of data in CONSTANT SPACE.

def source_summary(source, summary):
    return summary(source)

N = 10 ** 8
print(source_summary(range(N), min))
print(source_summary(range(N), max))
print(source_summary(range(N), sum))

Each line took a few seconds to execute, but used very little memory. However, It did require 3 separate traversals of the source. So this will not work if your source is a network connection, data acquisition hardware, etc. unless you cache all the data somewhere, losing the CONSTANT SPACE requirement.

Here's a version which demonstrates this problem

def source_summaries(source, *summaries):
    from itertools import tee
    return tuple(map(source_summary, tee(source, len(summaries)),
                                     summaries))

testit(source_summaries, N)
print('OK')

The test passes, but tee had to keep a copy of all the data, so the space usage goes up from O(1) to O(N).

How can you obtain the results in a single traversal with constant memory?

It is, of course, possible to pass the test given at the top, with O(1) space usage, by cheating: using knowledge of the specific iterator-consumers that the test uses. But that is not the point: source_summaries should work with any iterator consumables such as set, collections.Counter, ''.join, including any and all that may be written in the future. The implementation must treat them as black boxes.

To be clear: the only knowledge available about the consumers is that each one consumes one iterable and returns one result. Using any other knowledge about the consumer is cheating.

Ideas

[EDIT: I have posted an implementation of this idea as an answer]

I can imagine a solution (which I really don't like) that uses

preemptive threading
a custom iterator linking the consumer to the source

Let's call the custom iterator link.

For each consumer, run

result = consumer(<link instance for this thread>)
<link instance for this thread>.set_result(result)

on a separate thread.

On the main thread, something along the lines of

for item in source:
    for l in links:
        l.push(item)

for l in links:
    l.stop()

for thread in threads:
    thread.join()

return tuple(link.get_result, links)

link.__next__ blocks until the link instance receives
- .push(item) in which case it returns the item
- .stop() in which case it raises StopIteration
The data races look like a nightmare. You'd need a queue for the pushes, and probably a sentinel object would need to be placed in the queue by link.stop() ... and a bunch of other things I'm overlooking.

I would prefer to use cooperative threading, but consumer(link) seems to be unavoidably un-cooperative.

Do you have any less messy suggestions?

How "black box" do these functions have to be? Would it be to compute intermediate results like in a `reduce` call? That way, instead of computing `sum(some_list)` you could initialize `tmp = 0` and then in each iteration do `tmp = sum(tmp, current_value)`. You can do this for all three operations ( `min`, `max`, `sum`) simultaneously and will need only one pass over the elements. The only problem is to pick a meaningful initial value for `tmp` for each of the three operations. — Daniel Junglas, Apr 08 '20 at 12:30
@DanielJunglas Completely black box. Using `reduce` on an equivalent binary function, requires consumer-specific knowledge. As such, it falls under the 'cheating' that I mentioned in the question. I want to provide (something like) this as a library utility, which users can call with whatever consumers they want, including ones that haven't been invented today, so the *only* thing I can know about the consumer is that it consumes an iterable to produce a result. *Anything* beyond that is cheating. — jacg, Apr 08 '20 at 12:48
Are you doing this for the sake of the exercise or for a real world library? In the latter case, I think it would make sense to extend your function so that it takes an initializer for `tmp` as argument. If you look at the builtin `sum()` function then this is exactly what that function does. This is how you can use that function to sum up numbers or concatenate lists with the same implementation. Anyway, these are only my two cents. — Daniel Junglas, Apr 08 '20 at 15:28
@DanielJunglas The interface of an *arbitrary* consumer of iterables is `consumer(iterable)`: *nothing else*. It doesn't matter that `sum`, or `max` or any other specific one you have in mind offers more, the library can only rely on the *lowest common denominator*. This holds both in exercises and in the real world: it is fundamental property of what 'interface' means! — jacg, Apr 08 '20 at 17:04
@DanielJunglas I think I might understand where some of your confusion comes from: Perhaps you think than *all* consumers are overloaded like `max` and `min`: `max((1,2)) == max(1,2)` [note: fewer parentheses in the second case]. `max` is unusual in this respect: even `sum` (which you used in your first example) cannot be used in this way: `sum((1,2)) == 3` but `sum(1,2)` is an error. So your example of `tmp = sum(tmp, current_value)` is also an error. The vast majority fall into this category: `list`, `tuple`, `set`, `dict`, `collections.Counter`, `enumerate`, `partial(map, fn)`, etc. etc. — jacg, Apr 08 '20 at 22:00
Thanks a lot for the explanation. My example for `sum` was not written correctly. It should have read `sum([current_value], tmp)` or `sum([current_value, tmp])`, which is the same as `sum([current_value, tmp])` or `sum((current_value, tmp))`. And you are right, there are many potential consumers that don't fit the pattern of using an initial value. — Daniel Junglas, Apr 09 '20 at 05:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/211262/discussion-between-daniel-junglas-and-jacg). — Daniel Junglas, Apr 09 '20 at 05:12

score 1 · Answer 1 · answered Apr 09 '20 at 10:56

Here is an alternative implementation of your idea. It uses cooperative multi-threading. As you suggested, the key point is to use multi-threading and having the iterators __next__ method block until all threads have consumed the current iterate.

In addition, the iterator contains an (optional) buffer of constant size. With this buffer we can read the source in chunks and avoid a lot of the locking/synchronization.

My implementation also handles the case in which some consumers stop iterating before reaching the end of the iterator.

import threading

class BufferedMultiIter:
    def __init__(self, source, n, bufsize = 1):
        '''`source` is an iterator or iterable,
        `n` is the number of threads that will interact with this iterator,
        `bufsize` is the size of the internal buffer. The iterator will read
        and buffer elements from `source` in chunks of `bufsize`. The bigger
        the buffer is, the better the performance but also the bigger the
        (constant) space requirement.
        '''
        self._source = iter(source)
        self._n = n
        # Condition variable for synchronization
        self._cond = threading.Condition()
        # Buffered values
        bufsize = max(bufsize, 1)
        self._buffer = [None] * bufsize
        self._buffered = 0
        self._next = threading.local()
        # State variables to implement the "wait for buffer to get refilled"
        # protocol
        self._serial = 0
        self._waiting = 0

        # True if we reached the end of the source
        self._stop = False
        # Was the thread killed (for error handling)?
        self._killed = False

    def _fill_buffer(self):
        '''Refill the internal buffer.'''
        self._buffered = 0
        while self._buffered < len(self._buffer):
            try:
                self._buffer[self._buffered] = next(self._source)
                self._buffered += 1
            except StopIteration:
                self._stop = True
                break
            # Explicitly clear the unused part of the buffer to release
            # references as early as possible
            for i in range(self._buffered, len(self._buffer)):
                self._buffer[i] = None
        self._waiting = 0
        self._serial += 1

    def register_thread(self):
        '''Register a thread.

        Each thread that wants to access this iterator must first register
        with the iterator. It is an error to register the same thread more
        than once. It is an error to access this iterator with a thread that
        was not registered (with the exception of calling `kill`). It is an
        error to register more threads than the number that was passed to the
        constructor.
        '''
        self._next.i = 0

    def unregister_thread(self):
        '''Unregister a thread from this iterator.

        This should be called when a thread is done using the iterator.
        It catches the case in which a consumer does not consume all the
        elements from the iterator but exits early.
        '''
        assert hasattr(self._next, 'i')
        delattr(self._next, 'i')
        with self._cond:
            assert self._n > 0
            self._n -= 1
            if self._waiting == self._n:
                self._fill_buffer()
            self._cond.notify_all()

    def kill(self):
        '''Forcibly kill this iterator.

        This will wake up all threads currently blocked in `__next__` and
        will have them raise a `StopIteration`.
        This function should be called in case of error to terminate all
        threads as fast as possible.
        '''
        self._cond.acquire()
        self._killed = True
        self._stop = True
        self._cond.notify_all()
        self._cond.release()
    def __iter__(self): return self
    def __next__(self):
        if self._next.i == self._buffered:
            # We read everything from the buffer.
            # Wait until all other threads have also consumed the buffer
            # completely and then refill it.
            with self._cond:
                old = self._serial
                self._waiting += 1
                if self._waiting == self._n:
                    self._fill_buffer()
                    self._cond.notify_all()
                else:
                    # Wait until the serial number changes. A change in
                    # serial number indicates that another thread has filled
                    # the buffer
                    while self._serial == old and not self._killed:
                        self._cond.wait()
            # Start at beginning of newly filled buffer
            self._next.i = 0

        if self._killed:
            raise StopIteration
        k = self._next.i
        if k == self._buffered and self._stop:
            raise StopIteration
        value = self._buffer[k]
        self._next.i = k + 1
        return value

class NotAll:
    '''A consumer that does not consume all the elements from the source.'''
    def __init__(self, limit):
        self._limit = limit
        self._consumed = 0
    def __call__(self, it):
        last = None
        for k in it:
            last = k
            self._consumed += 1
            if self._consumed >= self._limit:
                break
        return last

def multi_iter(iterable, *consumers, **kwargs):
    '''Iterate using multiple consumers.

    Each value in `iterable` is presented to each of the `consumers`.
    The function returns a tuple with the results of all `consumers`.

    There is an optional `bufsize` argument. This controls the internal
    buffer size. The bigger the buffer, the better the performance, but also
    the bigger the (constant) space requirement of the operation.

    NOTE: This will spawn a new thread for each consumer! The iteration is
    multi-threaded and happens in parallel for each element.
    '''
    n = len(consumers)
    it = BufferedMultiIter(iterable, n, kwargs.get('bufsize', 1))
    threads = list() # List with **running** threads
    result = [None] * n
    def thread_func(i, c):
        it.register_thread()
        result[i] = c(it)
        it.unregister_thread()
    try:
        for c in consumers:
            t = threading.Thread(target = thread_func, args = (len(threads), c))
            t.start()
            threads.append(t)
    except:
        # Here we should forcibly kill all the threads but there is not
        # t.kill() function or similar. So the best we can do is stop the
        # iterator
        it.kill()
    finally:
        while len(threads) > 0:
            t = threads.pop(-1)
            t.join()
    return tuple(result)

from time import time
N = 10 ** 7
notall1 = NotAll(1)
notall1000 = NotAll(1000)
start1 = time()
res1 = (min(range(N)), max(range(N)), sum(range(N)), NotAll(1)(range(N)),
        NotAll(1000)(range(N)))
stop1 = time()
print('5 iterators: %s %.2f' % (str(res1), stop1 - start1))

for p in range(5):
    start2 = time()
    res2 = multi_iter(range(N), min, max, sum, NotAll(1), NotAll(1000),
                      bufsize = 2**p)
    stop2 = time()
    print('multi_iter%d: %s %.2f' % (p, str(res2), stop2 - start2))

The timings are again horrible but you can see how using a constant size buffer improves things significantly:

5 iterators: (0, 9999999, 49999995000000, 0, 999) 0.71
multi_iter0: (0, 9999999, 49999995000000, 0, 999) 342.36
multi_iter1: (0, 9999999, 49999995000000, 0, 999) 264.71
multi_iter2: (0, 9999999, 49999995000000, 0, 999) 151.06
multi_iter3: (0, 9999999, 49999995000000, 0, 999) 95.79
multi_iter4: (0, 9999999, 49999995000000, 0, 999) 72.79

Maybe this can serve as a source of ideas for a good implementation.

Thanks! You got me really excited when you mentioned *cooperative* threading ... but I can only find preemtive :-( [You're still using the `threading` module: that's preemptive; and I can't find any yields or asyncio or anything else that looks like it might implement cooperative multitasking.] I like the thread deregistration idea to solve the early termination problem. Thanks for the buffer timings, though I'm surprised by how much slower it is than mine ... without profiling, I guess there's more blocking on the CV than there was on the queues. But I'll leave that for another day. — jacg, Apr 10 '20 at 00:14
Some measurements: With a trivial source and consumers (`range`, `min`, `max`, `sum`) but implementation slightly slowed down by instrumentation, my queues max out at about 6500 elements each, asymptoticaly approaching it as the source length grows. So for the maximum buffer size of 4 that you timed, there is probably going going to be a *lot* of blocking. If I inject `sleep(0.001)` (1 millisecond) per item into the source, my implementation is 1% slower than plain iteration, and the queue length almost never exceeds 1; 0.1 ms delay -> 20% slowdown, queue length < 10. — jacg, Apr 10 '20 at 15:00
I guess it depends on the exact definition of "cooperative". In my code a thread always runs until the full buffer is read. At that point it is logically blocked and voluntarily relinquishes the CPU (cooperative). So at the "Python layer" it is never preempted. Anyway, it is not what you were looking for. But there is something else I noticed about your queuing approach: what if you have more threads than CPUs and there is one thread that has lower priority than all others? That thread would only run after all others are complete and the queue for this thread would buffer the whole sequence? — Daniel Junglas, Apr 14 '20 at 09:39
Where *exactly* is it that a thread "voluntarily relinquishes the CPU"? Are you confusing releasing CV with relinquishing CPU? `threading` will *preemptively* rip the CPU from under the thread which has locked a CV, other threads which need that lock will do nothing with the CPU time they are given and progress stalls for a while. I don't see any cooperative multitasking here, which, in Python, requires `yield` or `await` (`threading` is *preemptive*!) Number of CPUs is irrelevant: `threading` uses only 1 CPU (because GIL; cf. `multiprocessing`). Do thread priorities exist in `threading`? — jacg, Apr 14 '20 at 17:19
Sorry, I was not clear about what I meant to say when I mentioned thread priorities. My point is: your queue implementation does not satisfy the "constant space" requirement. The queue for a thread may become arbitrarily large. In the worst case it has to buffer all elements from the input sequence. Arguing with thread priorities was just one example how to force this behavior. But you can actually force that in an easier way: create a consumer that does a `time.sleep(60)` before processing the first element. The queue for the thread with this consumer will buffer the full sequence. — Daniel Junglas, Apr 15 '20 at 07:36
Yes, the queuing solution is far from ideal. Your example of one slower consumer causing the buffer to be filled, illustrates the benefits of cooperative multitasking: if each consumer would yield control after consuming one item (as opposed to the scheduler grabbing it whenever it feels like it), and the tasks were run on a round-robin schedule, the buffer size would be exactly 1, and the CPU would get 100% utilization: each task gets exactly the proportion it needs. `threading` will keep sharing the CPU 'fairly' between the busy task and the blocked ones, which is a waste of CPU time. — jacg, Apr 15 '20 at 10:36
Are you sure that the scheduler assigns time to threads blocked in a condition variable? I am no expert in Python threading but I am sure that is not how things are implemented at the OS level. A thread that is blocked in a lock or condition variable is not in a runnable state and will not run unless explicitly woken up. I have no idea how much of this is carried over through Python. Note that you can emulate round-robin by having one condition variable/semaphore per thread. After a thread starts, it signals the condition for the next. So only one thread is ever active. — Daniel Junglas, Apr 15 '20 at 12:05

jacg · Answer 2 · 2020-04-10T09:22:45.530

Here is an implementation of the preemptive threading solution outlined in the original question.

[EDIT: There is a serious problem with this implementation. [EDIT, now fixed, using a solution inspired by Daniel Junglas.]

Consumers which do not iterate through the whole iterable, will cause a space leak in the queue inside Link. For example:


def exceeds_10(iterable):
    for item in iterable:
        if item > 10:
            return True
    return False

if you use this as one of the consumers and use the source range(10**6), it will stop removing items from the queue inside Link after the first 11 items, leaving approximately 10**6 items to be accumulated in the queue!

]


class Link:

    def __init__(self, queue):
        self.queue = queue

    def __iter__(self):
        return self

    def __next__(self):
        item = self.queue.get()
        if item is FINISHED:
            raise StopIteration
        return item

    def put(self, item):
        self.queue.put(item)

    def stop(self):
        self.queue.put(FINISHED)

    def consumer_not_listening_any_more(self):
        self.__class__ = ClosedLink


class ClosedLink:

    def put(self, _): pass
    def stop(self)  : pass


class FINISHED: pass


def make_thread(link, consumer, future):
    from threading import Thread
    return Thread(target = lambda: on_thread(link, consumer, future))

def on_thread(link, consumer, future):
    future.set_result(consumer(link))
    link.consumer_not_listening_any_more()

def source_summaries_PREEMPTIVE_THREAD(source, *consumers):
    from queue     import SimpleQueue as Queue
    from asyncio   import Future

    links   = tuple(Link(Queue()) for _ in consumers)
    futures = tuple(     Future() for _ in consumers)
    threads = tuple(map(make_thread, links, consumers, futures))

    for thread in threads:
        thread.start()

    for item in source:
        for link in links:
            link.put(item)

    for link in links:
        link.stop()

    for t in threads:
        t.join()

    return tuple(f.result() for f in futures)

It works, but (unsirprisingly) with a horrible degradation in performance:

def time(thunk):
    from time import time
    start = time()
    thunk()
    stop  = time()
    return stop - start

N = 10 ** 7
t = time(lambda: testit(source_summaries, N))
print(f'old: {N} in {t:5.1f} s')

t = time(lambda: testit(source_summaries_PREEMPTIVE_THREAD, N))
print(f'new: {N} in {t:5.1f} s')

giving

old: 10000000 in   1.2 s
new: 10000000 in  30.1 s

So, even though this is a theoretical solution, it is not a practical one[*].

Consequently, I think that this approach is a dead end, unless there's a way of persuading consumer to yield cooperatively (as opposed to forcing it to yield preemptively) in

def on_thread(link, consumer, future):
    future.set_result(consumer(link))

... but that seems fundamentally impossible. Would love to be proven wrong.

[*] This is actually a bit harsh: the test does absolutely nothing with trivial data; if this were part of a larger computation which performed heavy calculations on the elements, then this approach could be genuinely useful.

How can you feed an iterable to multiple consumers in constant space?

TLDR

Discussion

Ideas

2 Answers2