How do I use threads on a generator (multiple threads per item) while keeping the order?

Question

I have a code that is mimicking a REST API call (see below).

For every key in the item of the generator, it needs to run a REST call. So in my example, a record could be

{"a": 2, "b": 36, "c": 77}

I need to run a REST call for every key (a, b, and c) individually, then output the results (which just negates the number):

{"a": 2, "a_neg": -2, "b": 36, "b_neg": -36, "c": 77, "c_neg": -77}

Right now my current code works for one key, but with multiple keys, it will repeat the items (so I'm getting triple the results for 3 keys).

Also there is some funky race condition that occurs as well. I guess I could only keep the last record, but I'm not good with threads and concerned about thread safety or other advanced stuff.

Here is an example output:

{'a': 89, 'a_neg': -89, 'b': 69, 'c': 38}
{'a': 89, 'a_neg': -89, 'b': 69, 'c': 38, 'c_neg': -38}
{'a': 89, 'a_neg': -89, 'b': 69, 'b_neg': -69, 'c': 38, 'c_neg': -38}
{'a': 90, 'a_neg': -90, 'b': 43, 'c': 16}
{'a': 90, 'a_neg': -90, 'b': 43, 'c': 16, 'c_neg': -16}
{'a': 90, 'a_neg': -90, 'b': 43, 'b_neg': -43, 'c': 16, 'c_neg': -16}
{'a': 91, 'a_neg': -91, 'b': 49, 'b_neg': -49, 'c': 77, 'c_neg': -77}
{'a': 91, 'a_neg': -91, 'b': 49, 'b_neg': -49, 'c': 77, 'c_neg': -77}
{'a': 91, 'a_neg': -91, 'b': 49, 'b_neg': -49, 'c': 77, 'c_neg': -77}

Finally here is my source code (you can run it yourself):

#!/usr/bin/env python

from concurrent.futures import ThreadPoolExecutor
from time import sleep
from pprint import pprint
import random

def records():
    # simulates records generator
    for i in range(100):
        yield {"a": i, "b": random.randint(0,100), "c": random.randint(0,100)}

def stream(records):
    threads = 8
    pool = ThreadPoolExecutor(threads)

    def rest_api_lookup(record_dict):
        # simulates REST call :)
        sleep(0.1)
        key = record_dict["key"]
        record = record_dict["record"]

        record[key + "_neg"] = -record[key]

        return record

    def thread(records):
        chunk = []
        for record in records:
            for key in record:
                chunk.append(pool.submit(rest_api_lookup, {"record": record, "key": key}))

            if len(chunk) == threads:
                yield chunk
                chunk = []

        if chunk:
            yield chunk

    def unchunk(chunk_gen):
        """Flattens a generator of Future chunks into a generator of Future results."""
        for chunk in chunk_gen:
            for f in chunk:
                yield f.result() # get result from Future

    # Now iterate over all results in same order as records
    for result in unchunk(thread(records)):
        #yield result
        pprint(result)

stream(records())

Try iterating over the dict keys in `rest_api_lookup`, not in `thread`. The requests for each key will execute sequentially, but you'll still run the stream of records concurrently at least. — dnswlt, Feb 15 '18 at 22:09
But in reality the REST API isn't capable of doing 3 inputs in one call. That's why I still need to do a call per key. — hobbes3, Feb 15 '18 at 22:10
That's what I'm saying. Loop over the keys, make one rest call after the other, per key. — dnswlt, Feb 15 '18 at 22:11
That's slow down if I have a lot of keys (since it won't be concurrent per item). Is it safe to somehow dedup in `unchunk`? — hobbes3, Feb 15 '18 at 22:18
Is it safe to do `for i, result enumerate(unchunk(thread(records))): if (i + 1) % 3 == 0: pprint(result)` inside `unchunk()`? Or do I need to make sure all the threads finished first? — hobbes3, Feb 16 '18 at 01:22
No, it's not safe. How many keys do you expect to be in the dict? If the list of records is long, and the number of keys per record rather small, then try what I proposed. If it's the other way around, the code should be restructured. Did you already measure that it's "slow"? — dnswlt, Feb 16 '18 at 06:35
In a realistic use case, it will be hundreds, maybe thousands of records, and only a few keys. Your propose of a simple loop over the keys will be expensive for every key. For example, 2 keys will now take twice as long, 3 keys will take three times as long, etc. And if it takes like a minute to do 1000 records for a single key, then I don't want to wait another minute if there are 2 keys. — hobbes3, Feb 16 '18 at 07:11
Look, if you have N threads, and K*N chunks, it does not really matter if a thread processes all keys sequentially or any other items of the chunk. As long as each thread is busy, you'll be fine. An alternative would be using asyncio, but really you should measure if you really need this. A couple of thousand items sounds like not much at all — dnswlt, Feb 16 '18 at 07:43
OK, I finally get what you're saying... Sorry, I didn't understand what you were saying until I tested your idea vs my (bad) idea using modulo to dedup in `unchunk()`. You're saying as long as there are a lot more rows than keys, then it's ok since the threads will take longer, but it'll cancel out since I require less threads anyway. Thanks for your help! — hobbes3, Feb 18 '18 at 04:42

Yoav Glazner · Answer 1 · 2018-02-16T10:15:36.353

1st issue here is that your looping over keys in a record that grows...

for key in list(record):  # make a copy of the keys!

I think the 2nd issue here is that you have 3 keys and 8 threads... len(chunk) will be 3, 6, 9 ... threads is 8 - the following condition is not reached

        if len(chunk) == threads:  # try len(chunk) >= threads
            yield chunk
            chunk = []

last issue is that you yield uncompleted records before all threads are finish. here is a possible fix:

def unchunk(chunk_gen):
    """Flattens a generator of Future chunks into a generator of Future results."""
    for chunk in chunk_gen:
        old_res = None
        for f in chunk:
            res = f.result() # get result from Future
            if old_res and res is not old_res:
                yield old_res
            old_res = res
    if old_res:
        yield old_res

Hmm although `record` grows, `key` doesn't seem to ever go through the `*_neg` (output) keys, but I'll make use `list(record)` just to be safe anyway. I'll make `len(chunk) >=` instead of `==` to be safe as well, but `chunk` grows one by one so this wasn't a problem. — hobbes3, Feb 16 '18 at 07:20
@hobbes3 - I've updated my answer after running your code, it seems to run fine now — Yoav Glazner, Feb 16 '18 at 10:16

How do I use threads on a generator (multiple threads per item) while keeping the order?

1 Answers1

Linked