-1

I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.

Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).

What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?

Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?

Thanks!

Anton K
  • 91
  • 1
  • 8
  • Any reason you can't just use Pool's [`apply_async`](https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.apply_async) in a loop to spawn your processes? Even better if you store/discard back in the main process in the result callback. – Linuxios May 09 '17 at 19:14
  • I have a generator object that generates inputs for my pool. Map call will ask the generator for next input only when a worker becomes free so I basically generate inputs on demand. If I use apply_async in a loop I assume It would immediately generate a large number of pending tasks for my pool that would consume all my memory. Or am I wrong? – Anton K May 09 '17 at 19:45
  • Aha. I think you might have to just do this manually. You can still use multiprocessing to launch the processes and manage the interprocess communication, but you might just have to manually write the subprocesses that process, and when finished, ask the generator for a new value to compute on. – Linuxios May 09 '17 at 19:47
  • `multiprocessing` is a great library, but I'm not sure it's that flexible. Sorry couldn't give a more satisfying answer! – Linuxios May 09 '17 at 19:47
  • That's what I was afraid of - having to do it manually :) No problem, thanks for your comments! – Anton K May 09 '17 at 19:54
  • 1
    `None` is a singleton, so you're just assigning millions of labels, not creating a bullion results. This is unlikely to cause you to run out of memory before garbage collection. Try profiling it and see if your imagined problem is a real one. – Efron Licht May 09 '17 at 20:12
  • Profiling won't prove that it will always work. I agree it is highly unlikely though. – Anton K May 09 '17 at 21:37

1 Answers1

0

Question: I still want to keep nice features of the pool

Remove return result from multiprocessing.Pool.

  1. Copy class MapResult and inherit from mp.pool.ApplyResult.
    Add, replace ,comment the following:

    import multiprocessing as mp
    from multiprocessing.pool import Pool
    
    class MapResult(mp.pool.ApplyResult):
        def __init__(self, cache, chunksize, length, callback, error_callback):
            super().__init__(cache, callback, error_callback=error_callback)
            ...
            #self._value = [None] * length
            self._value = None
            ...
        def _set(self, i, success_result):
            ...
            if success:
                #self._value[i*self._chunksize:(i+1)*self._chunksize] = result
    
  2. Create your own class myPool(Pool) inherit from multiprocessing.Pool.
    Copy def _map_async(... from multiprocessing.Pool.
    Add, replace, comment the following:

    class myPool(Pool):
        def __init__(self, processes=1):
            super().__init__(processes=processes)
    
        def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
                error_callback=None):
            ...
            #if self._state != RUN:
            if self._state != mp.pool.RUN:
            ...
            #return result
    

Tested with Python: 3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • Ability to generate inputs on demand (map call will ask my generator for next input only when a worker becomes free), reusing processes, just in general it is convenient to use except that one little thing with results. Sorry, I didn't mean I implemented my own managed dictionary, I used the OOB one. Edited the same in the post. – Anton K May 09 '17 at 21:33
  • Sorry again, OOB = out of box. Just the Manager.dict() from multiprocessing :) – Anton K May 09 '17 at 21:48
  • @TTT: Updated my Answer with a HowTo Remove `return result` from `Pool`. – stovfl May 19 '17 at 10:55