The following approach should be feasible assuming:
- Your
Handler
class can be "pickled" and
- The
Handler
class does not carry so much state information so as to make its serialization to and from each worker invocation prohibitively expensive.
The main process creates a handlers
dictionary where the key is one of the 52 symbols and the value is a dictionary with two keys: 'handler' whose value is the handler for the symbol and 'processing' whose value is either True
or False
according to whether a process is currently processing one or more payloads for that symbol.
Each process in the pool is initialized with another queue_dict
dictionary whose key is one of the 52 symbols and whose value is a multiprocessing.Queue
instance that will hold payload instances to be processed for that symbol.
The main process iterates each line of the input to get the next symbol/payload pair. The payload in enqueued onto the appropriate queue for the current symbol. The handlers
dictionary is accessed to determine whether a task has been enqueued to the processing pool to handle the symbol-specific handler for the current symbol by inspecting the processing
flag for the current symbol. If this flag is True
, nothing further need be done. Otherwise, the processing
flag is set to True
and apply_async
is invoked passing as an argument the handler for this symbol.
A count of enqueued tasks (i.e. payloads) is maintained and is incremented every time the main task writes a payload to one of the 52 handler queues. The worker function specified as the argument to apply_async
takes its handler argument and from that deduces the queue that requires processing. For every payload it finds on the queue, it invokes the handler's feed
method. It then returns a tuple consisting of the updated handler and a count of the number of payload messages that were removed from the queue. The callback function for the apply_async
method (1) updates the handler in the handlers
dictionary and (2) resets the processing
flag for the appropriate symbol to False
. Finally, it decrements the number of enqueued tasks by the number of payload messages that had been removed.
When the main process after enqueuing a payload checks to see if there is currently a process running a handler for this symbol and sees that the processing
flag is True
and on that basis does not submit a new task via apply_async
, there is a small window where that worker has already finished processing all of its payloads on its queue and is about to return or has already returned and the callback function has just not yet set the processing
flag to False
. In that scenario the payload will sit unprocessed on the queue until the next payload for that symbol is read from the input and processed. But if there are no further input lines for that symbol, then when all tasks have completed we will have unprocessed payloads. But we will also have a non-zero count of enqueued tasks that indicates to us we have this situation. So rather than trying to implement a complicated multiprocessing synchronization protocol, it is just simpler to detect this situation and to handle it by recreating a new pool and checking each of the 52 queues.
from multiprocessing import Pool, Queue
import time
from queue import Empty
from threading import Lock
# This class needs to be Pickle-able:
class Handler:
def __init__(self, symbol):
self.symbol = symbol
self.counter = 0
def feed(self, payload):
# For testing just increment counter by payload:
self.counter += payload
def init_pool(the_queue_dict):
global queue_dict
queue_dict = the_queue_dict
def worker(handler):
symbol = handler.symbol
q = queue_dict[symbol]
tasks_removed = 0
while True:
try:
payload = q.get_nowait()
handler.feed(payload)
tasks_removed += 1
except Empty:
break
# return updated handler:
return handler, tasks_removed
def callback_result(result):
global queued_tasks
global lock
handler, tasks_removed = result
# show done processing this symbol by updating handler state:
d = handlers[handler.symbol]
# The order of the next two statements matter:
d['handler'] = handler
d['processing'] = False
with lock:
queued_tasks -= tasks_removed
def main():
global handlers
global lock
global queued_tasks
symbols = [
'A','B','C','D','E','F','G','H','I','J','K','L','M','AA','BB','CC','DD','EE','FF','GG','HH','II','JJ','KK','LL','MM',
'a','b','c','d','e','f','g','h','i','j','k','l','m','aa','bb','cc','dd','ee','ff','gg','hh','ii','jj','kk','ll','mm'
]
queue_dict = {symbol: Queue() for symbol in symbols}
handlers = {symbol: {'processing': False, 'handler': Handler(symbol)} for symbol in symbols}
lines = [
('A',1),('B',1),('C',1),('D',1),('E',1),('F',1),('G',1),('H',1),('I',1),('J',1),('K',1),('L',1),('M',1),
('AA',1),('BB',1),('CC',1),('DD',1),('EE',1),('FF',1),('GG',1),('HH',1),('II',1),('JJ',1),('KK',1),('LL',1),('MM',1),
('a',1),('b',1),('c',1),('d',1),('e',1),('f',1),('g',1),('h',1),('i',1),('j',1),('k',1),('l',1),('m',1),
('aa',1),('bb',1),('cc',1),('dd',1),('ee',1),('ff',1),('gg',1),('hh',1),('ii',1),('jj',1),('kk',1),('ll',1),('mm',1)
]
def get_lines():
# Emulate 52_000 lines:
for _ in range(10_000):
for line in lines:
yield line
POOL_SIZE = 4
queued_tasks = 0
lock = Lock()
# Create pool of POOL_SIZE processes:
pool = Pool(POOL_SIZE, initializer=init_pool, initargs=(queue_dict,))
for symbol, payload in get_lines():
# Put some limit on memory utilization:
while queued_tasks > 10_000:
time.sleep(.001)
d = handlers[symbol]
q = queue_dict[symbol]
q.put(payload)
with lock:
queued_tasks += 1
if not d['processing']:
d['processing'] = True
handler = d['handler']
pool.apply_async(worker, args=(handler,), callback=callback_result)
# Wait for all tasks to complete
pool.close()
pool.join()
if queued_tasks:
# Re-create pool:
pool = Pool(POOL_SIZE, initializer=init_pool, initargs=(queue_dict,))
for d in handlers.values():
handler = d['handler']
d['processing'] = True
pool.apply_async(worker, args=(handler,), callback=callback_result)
pool.close()
pool.join()
assert queued_tasks == 0
# Print results:
for d in handlers.values():
handler = d['handler']
print(handler.symbol, handler.counter)
if __name__ == "__main__":
main()
Prints:
A 10000
B 10000
C 10000
D 10000
E 10000
F 10000
G 10000
H 10000
I 10000
J 10000
K 10000
L 10000
M 10000
AA 10000
BB 10000
CC 10000
DD 10000
EE 10000
FF 10000
GG 10000
HH 10000
II 10000
JJ 10000
KK 10000
LL 10000
MM 10000
a 10000
b 10000
c 10000
d 10000
e 10000
f 10000
g 10000
h 10000
i 10000
j 10000
k 10000
l 10000
m 10000
aa 10000
bb 10000
cc 10000
dd 10000
ee 10000
ff 10000
gg 10000
hh 10000
ii 10000
jj 10000
kk 10000
ll 10000
mm 10000