I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.
Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).
What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?
Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?
Thanks!