return counter object in multiprocessing / map function

Question

I have an python script running, that starts the same function in multiple threads. The functions creates and process 2 counters (c1 and c2). The result of all c1 counters from the forked processes should be merged together. Same to the results of all the c2 counters, returned by the different forks.

My (pseudo)-code looks like that:

def countIt(cfg)
   c1 = Counter
   c2 = Counter
   #do some things and fill the counters by counting words in an text, like
   #c1= Counter({'apple': 3, 'banana': 0})
   #c2= Counter({'blue': 3, 'green': 0})    

   return c1, c2

if __name__ == '__main__':
        cP1 = Counter()
        cP2 = Counter()
        cfg = "myConfig"
        p = multiprocessing.Pool(4)  #creating 4 forks
        c1, c2 = p.map(countIt,cfg)[:2]
        # 1.) This will only work with [:2] which seams to be no good idea
        # 2.) at this point c1 and c2 are lists, not a counter anymore,
        # so the following will not work:
        cP1 + c1
        cP2 + c2

Following the example above, I need a result like: cP1 = Counter({'apple': 25, 'banana': 247, 'orange': 24}) cP2 = Counter({'red': 11, 'blue': 56, 'green': 3})

So my question: how can I count things insight a forked process in order to aggregate each counter (all c1 and all c2) in the parent process?

@mattm that will not work, because `sum()` will not return a counter?! The following error occours: `TypeError: unsupported operand type(s) for +: 'int' and 'Counter'` — The Bndr, Sep 30 '15 at 13:04
At a minimum this line is definitely an error: `c1, c2 = p.map(countIt,cfg)[:2]`. You can see how to deal with the results in swenzel's answer. — KobeJohn, Sep 30 '15 at 14:02

swenzel · Accepted Answer · 2015-09-30T14:20:17.480

You need to "unzip" your result by using for example a for-each loop. You will receive a list of tuples where each tuple is a pair of counters: (c1, c2).
With your current solution you actually mix them up. You assigned [(c1a, c2a), (c1b, c2b)] to c1, c2 meaning that c1 contains (c1a, c2a) and c2 contains (c1b, c2b).

Try this:

if __name__ == '__main__':
        from contextlib import closing

        cP1 = Counter()
        cP2 = Counter()

        # I hope you have an actual list of configs here, otherwise map will
        # will call `countIt` with the single characters of the string 'myConfig'
        cfg = "myConfig"

        # `contextlib.closing` makes sure the pool is closed after we're done.
        # In python3, Pool is itself a contextmanager and you don't need to
        # surround it with `closing` in order to be able to use it in the `with`
        # construct.
        # This approach, however, is compatible with both python2 and python3.
        with closing(multiprocessing.Pool(4)) as p:
            # Just counting, no need to order the results.
            # This might actually be a bit faster.
            for c1, c2 in p.imap_unordered(countIt, cfg):
                cP1 += c1
                cP2 += c2

Not OP but thanks for improving the code with the closing context manager. I hadn't seen it before, maybe because I haven't used mp in py3 yet. — KobeJohn, Sep 30 '15 at 14:01
Thank you, that works. Just some minutes before I found out, that python builds a list of all results of all the forks, like `[(counter(),counter()), (counter(),counter()),....]`. So your answer fits exactly that point. Thank you. Using `closing` is absolutely new, but interesting! :-) — The Bndr, Sep 30 '15 at 14:18

return counter object in multiprocessing / map function

1 Answers1