How can I execute a function on a CPU core, and get a callback when it has completed?

Question

Context

I'm recieving a stream:

symbols = ['ABC', 'DFG', ...]  # 52 of these

handlers = { symbol: Handler(symbol) for symbol in symbols }

async for symbol, payload in lines:  # 600M of these
    handlers[symbol].feed(payload)

I need to make use of multiple CPU cores to speed it up.

handler['ABC'] (e.g.) holds state, but it's disjoint from the state of (e.g.) handler['DFG']

Basically I can't have 2 cores simultaneously operating e.g. handler['ABC'].

My approach so far

I have come up with the following solution, but it's part pseudocode, as I can't see how to implement it.

NCORES = 4
symbol_curr_active_on_core = [None]*NCORES

NO_CORES_FREE = -1
def first_free_core():
    for i, symbol in enumerate(symbol_curr_active_on_core):
        if not symbol:
            return i
    return NO_CORES_FREE

for symbol, payload in lines:
    # wait for avail core to handle it

    while True:
        sleep(0.001)
        if first_free_core() == NO_CORES_FREE:
            continue
        if symbol in symbol_curr_active_on_core:
            continue
        core = first_free_core()
        symbol_curr_active_on_core[core] = symbol

        cores[core].execute(
            processor[symbol].feed(payload),
            on_complete=lambda core_index: \
                symbol_curr_active_on_core[core_index] = None
        )

So my question is specifically: How to convert that last statement into working Python code?

        cores[core].execute(
            processor[symbol].feed(payload),
            on_complete=lambda core_index: \
                symbol_curr_active_on_core[core_index] = None
        )

PS More generally, is my approach optimal?

I can understand why you do not want two separate processes working on the same symbol. But why can't two different processes processing their distinct sets of symbols be scheduled to run on the same core assuming that these processes are isolated from one another? — Booboo, Jul 24 '21 at 11:47
If I partition my symbols between processes, I lose efficiency through variance in execution-times. But that's what I've done now, and it works a treat! — P i, Jul 24 '21 at 12:40
If you have 4 processes and each is ready to run, i.e. not waiting for I/O to complete for example, and you have at least 4 physical cores *not running other work*, they they will all run on 4 different cores in parallel (this is all a big *if*). BUT a given process is not guaranteed to always run on the same core when it is dispatched. As far as I know, there is no way in Python to specify a CPU core affinity specifying that a given process can only run on a specific core. And it would be self-defeating performance-wise for you to specify such an affinity if you could. — Booboo, Jul 24 '21 at 12:52
But it sounds like you don't even require that the same process always processes the same symbol. Did I get that right? — Booboo, Jul 24 '21 at 12:55

Booboo · Answer 1 · 2021-07-25T18:38:52.010

The following approach should be feasible assuming:

Your Handler class can be "pickled" and
The Handler class does not carry so much state information so as to make its serialization to and from each worker invocation prohibitively expensive.

The main process creates a handlers dictionary where the key is one of the 52 symbols and the value is a dictionary with two keys: 'handler' whose value is the handler for the symbol and 'processing' whose value is either True or False according to whether a process is currently processing one or more payloads for that symbol.

Each process in the pool is initialized with another queue_dict dictionary whose key is one of the 52 symbols and whose value is a multiprocessing.Queue instance that will hold payload instances to be processed for that symbol.

The main process iterates each line of the input to get the next symbol/payload pair. The payload in enqueued onto the appropriate queue for the current symbol. The handlers dictionary is accessed to determine whether a task has been enqueued to the processing pool to handle the symbol-specific handler for the current symbol by inspecting the processing flag for the current symbol. If this flag is True, nothing further need be done. Otherwise, the processing flag is set to True and apply_async is invoked passing as an argument the handler for this symbol.

A count of enqueued tasks (i.e. payloads) is maintained and is incremented every time the main task writes a payload to one of the 52 handler queues. The worker function specified as the argument to apply_async takes its handler argument and from that deduces the queue that requires processing. For every payload it finds on the queue, it invokes the handler's feed method. It then returns a tuple consisting of the updated handler and a count of the number of payload messages that were removed from the queue. The callback function for the apply_async method (1) updates the handler in the handlers dictionary and (2) resets the processing flag for the appropriate symbol to False. Finally, it decrements the number of enqueued tasks by the number of payload messages that had been removed.

When the main process after enqueuing a payload checks to see if there is currently a process running a handler for this symbol and sees that the processing flag is True and on that basis does not submit a new task via apply_async, there is a small window where that worker has already finished processing all of its payloads on its queue and is about to return or has already returned and the callback function has just not yet set the processing flag to False. In that scenario the payload will sit unprocessed on the queue until the next payload for that symbol is read from the input and processed. But if there are no further input lines for that symbol, then when all tasks have completed we will have unprocessed payloads. But we will also have a non-zero count of enqueued tasks that indicates to us we have this situation. So rather than trying to implement a complicated multiprocessing synchronization protocol, it is just simpler to detect this situation and to handle it by recreating a new pool and checking each of the 52 queues.

from multiprocessing import Pool, Queue
import time
from queue import Empty
from threading import Lock

# This class needs to be Pickle-able:
class Handler:
    def __init__(self, symbol):
        self.symbol = symbol
        self.counter = 0

    def feed(self, payload):
        # For testing just increment counter by payload:
        self.counter += payload


def init_pool(the_queue_dict):
    global queue_dict
    queue_dict = the_queue_dict


def worker(handler):
    symbol = handler.symbol
    q = queue_dict[symbol]
    tasks_removed = 0
    while True:
        try:
            payload = q.get_nowait()
            handler.feed(payload)
            tasks_removed += 1
        except Empty:
            break
    # return updated handler:
    return handler, tasks_removed

def callback_result(result):
    global queued_tasks
    global lock

    handler, tasks_removed = result
    # show done processing this symbol by updating handler state:
    d = handlers[handler.symbol]
    # The order of the next two statements matter:
    d['handler'] = handler
    d['processing'] = False
    with lock:
        queued_tasks -= tasks_removed

def main():
    global handlers
    global lock
    global queued_tasks

    symbols = [
        'A','B','C','D','E','F','G','H','I','J','K','L','M','AA','BB','CC','DD','EE','FF','GG','HH','II','JJ','KK','LL','MM',
        'a','b','c','d','e','f','g','h','i','j','k','l','m','aa','bb','cc','dd','ee','ff','gg','hh','ii','jj','kk','ll','mm'
    ]

    queue_dict = {symbol: Queue() for symbol in symbols}

    handlers = {symbol: {'processing': False, 'handler': Handler(symbol)} for symbol in symbols}

    lines = [
        ('A',1),('B',1),('C',1),('D',1),('E',1),('F',1),('G',1),('H',1),('I',1),('J',1),('K',1),('L',1),('M',1),
        ('AA',1),('BB',1),('CC',1),('DD',1),('EE',1),('FF',1),('GG',1),('HH',1),('II',1),('JJ',1),('KK',1),('LL',1),('MM',1),
        ('a',1),('b',1),('c',1),('d',1),('e',1),('f',1),('g',1),('h',1),('i',1),('j',1),('k',1),('l',1),('m',1),
        ('aa',1),('bb',1),('cc',1),('dd',1),('ee',1),('ff',1),('gg',1),('hh',1),('ii',1),('jj',1),('kk',1),('ll',1),('mm',1)
    ]


    def get_lines():
        # Emulate 52_000 lines:
        for _ in range(10_000):
            for line in lines:
                yield line

    POOL_SIZE = 4

    queued_tasks = 0
    lock = Lock()

    # Create pool of POOL_SIZE processes:
    pool = Pool(POOL_SIZE, initializer=init_pool, initargs=(queue_dict,))
    for symbol, payload in get_lines():
        # Put some limit on memory utilization:
        while queued_tasks > 10_000:
            time.sleep(.001)
        d = handlers[symbol]
        q = queue_dict[symbol]
        q.put(payload)
        with lock:
            queued_tasks += 1
        if not d['processing']:
            d['processing'] = True
            handler = d['handler']
            pool.apply_async(worker, args=(handler,), callback=callback_result)
    # Wait for all tasks to complete
    pool.close()
    pool.join()

    if queued_tasks:
        # Re-create pool:
        pool = Pool(POOL_SIZE, initializer=init_pool, initargs=(queue_dict,))
        for d in handlers.values():
            handler = d['handler']
            d['processing'] = True
            pool.apply_async(worker, args=(handler,), callback=callback_result)
        pool.close()
        pool.join()
        assert queued_tasks == 0

    # Print results:
    for d in handlers.values():
        handler = d['handler']
        print(handler.symbol, handler.counter)


if __name__ == "__main__":
    main()

Prints:

score 1 · Answer 2 · edited Jul 24 '21 at 11:41

This is far from the only (or probably even "best") approach, but based on my comment on your other post, here's an example of having specific child processes handle specific "symbol"s

from multiprocessing import Process, Queue
from queue import Empty
from math import ceil

class STOPFLAG: pass

class Handler:
    def __init__(self, symbol):
        self.counter = 0 #maintain some state for each "Handler"
        self.symbol = symbol

    def feed(self, payload):
        self.counter += payload
        return self.counter

class Worker(Process):
    def __init__(self, out_q):
        self.handlers = {}
        self.in_q = Queue()
        self.out_q = out_q
        super().__init__()

    def run(self):
        while True:
            try:
                symbol = self.in_q.get(1)
            except Empty:
                pass #put break here if you always expect symbols to be available and a timeout "shouldn't" happen
            else:
                if isinstance(symbol, STOPFLAG):
                    #pass back the handlers with their now modified state
                    self.out_q.put(self.handlers)
                    break
                else:
                    self.handlers[symbol[0]].feed(symbol[1])
def main():
    n_workers = 4
    # Just 8 for testing:
    symbols = ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU', 'VWX']

    workers = []
    out_q = Queue()
    for i in range(n_workers):
        workers.append(Worker(out_q))
    symbol_worker_mapping = {}
    for i, symbol in enumerate(symbols):
        workers[i%n_workers].handlers[symbol] = Handler(symbol)
        symbol_worker_mapping[symbol] = i%n_workers

    for worker in workers: worker.start() #start processes

    # Just a few for testing:
    lines = [
        ('ABC', 1),
        ('DEF', 1),
        ('GHI', 1),
        ('JKL', 1),
        ('MNO', 1),
        ('PQR', 1),
        ('STU', 1),
        ('VWX', 1),
        ('ABC', 1),
        ('DEF', 1),
        ('GHI', 1),
        ('JKL', 1),
        ('MNO', 1),
        ('PQR', 1),
        ('STU', 1),
        ('VWX', 1),
    ]
    #putting this loop in a thread could allow results to be collected while inputs are still being fed in.
    for symbol, payload in lines: #feed in tasks
        worker = workers[symbol_worker_mapping[symbol]] #select the correct worker
        worker.in_q.put([symbol, payload]) #pass the inputs

    results = [] #results are handler dicts from each worker
    for worker in workers:
        worker.in_q.put(STOPFLAG()) #Send stop signal to each worker
        results.append(out_q.get()) #get results (may be out of order)

    for worker in workers: worker.join() #cleanup
    for result in results:
        for symbol, handler in result.items():
            print(symbol, handler.counter)


if __name__ == "__main__":
    main()

Each child process handles a subset of "symbols" and each gets their own input queue. this is different to the normal pool where each child is identical, and they all share an input queue where the next available child always takes the next input. They all then put results onto a shared output queue back to the main process.

An entirely different solution might be to hold all the state in the main process, maintain a lock for each symbol, and hold the lock while the necessary state is sent to the worker until the results are received, and the state in the main process is updated.

processes will often jump around between cores at the discretion of the OS scheduler. python does not have an easy way of telling the OS to keep a process on a particular physical core, but that usually does not matter, as the OS tries to manage the context switching relatively efficiently. — Aaron, Jul 26 '21 at 12:58

How can I execute a function on a CPU core, and get a callback when it has completed?

Context

My approach so far

2 Answers2

Linked