Why is my Multiprocessing Counter slower than collections.Counter?

Question

I have written a multiprocessing counter and compared it against the native collections.Counter.

Why is my Multiprocessing Counter slower than collections.Counter?

[multi-count.py]:

import io
from collections import Counter
from multiprocessing import Process, Manager, Lock
import random
import time

class MultiProcCounter(object):
    def __init__(self):
        self.dictionary = Manager().dict()
        self.lock = Lock()

    def increment(self, item):
        with self.lock:
            self.dictionary[item] = self.dictionary.get(item, 0) + 1

def func(counter, item):
    counter.increment(item)

def multiproc_count(inputs):
    counter = MultiProcCounter()
    procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
    for p in procs: p.start()
    for p in procs: p.join()
    return counter.dictionary

inputs = [random.randint(1,10) for _ in range(1000)]
start = time.time()
print (multiproc_count(inputs))
print (time.time() - start)
start = time.time()
print (Counter(inputs))
print (time.time() - start)

[out]:

{1: 88, 2: 95, 3: 99, 4: 98, 5: 102, 6: 111, 7: 99, 8: 103, 9: 97, 10: 108}
4.128664016723633
Counter({6: 111, 10: 108, 8: 103, 5: 102, 3: 99, 7: 99, 4: 98, 9: 97, 2: 95, 1: 88})
0.0006728172302246094

I've ran it with Python3:

$ ulimit -n 2048
$ python3 multi-count.py

To make the task harder, i've increased the inputs to size of 10000 and I get an OSError:

  File "multi-count.py", line 29, in <module>
    print (multiproc_count(inputs))
  File "multi-count.py", line 23, in multiproc_count
Process Process-2043:
    for p in procs: p.start()
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
Traceback (most recent call last):
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 709, in _callmethod
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 93, in run
  File "bpe-multi.py", line 18, in func
  File "bpe-multi.py", line 15, in increment
  File "<string>", line 2, in get
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 713, in _callmethod
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/managers.py", line 700, in _connect
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 487, in Client
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/connection.py", line 612, in SocketClient
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 134, in __init__
OSError: [Errno 24] Too many open files
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
    parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files

And I couldn't increase the ulimit anymore on my laptop:

$ ulimit -n 4096
-bash: ulimit: open files: cannot modify limit: Operation not permitted

Using multiprocesing.Pool:

import io
from collections import Counter
from multiprocessing import Process, Manager, Lock, Pool
import random
import time


def func(counter, x):
    counter[x] = counter.get(x, 0) + 1


inputs = [random.randint(1,10) for _ in range(10000)]

manager = Manager()
counter = manager.dict()

pool = Pool(4)
for x in inputs:
    pool.apply_async(func, [counter, x])
pool.close()
pool.join()

print counter

[out]:

$ time python multi-count.py 
{1: 978, 2: 978, 3: 997, 4: 982, 5: 958, 6: 1033, 7: 1044, 8: 1008, 9: 1007, 10: 1004}

real    0m16.187s
user    0m18.817s
sys 0m14.055s

With native collections.Counter:

$ time python3 -c 'import random; from collections import Counter; inputs = [random.randint(1,10) for _ in range(10000)]; print (Counter(inputs))'
Counter({6: 1067, 4: 1048, 3: 1021, 5: 1010, 9: 992, 7: 985, 8: 983, 1: 969, 2: 964, 10: 961})

real    0m0.099s
user    0m0.059s
sys 0m0.018s

$ time python3 -c 'import random; from collections import Counter; inputs = [random.randint(1,10) for _ in range(100000)]; print (Counter(inputs))'
Counter({9: 10159, 10: 10114, 8: 10046, 3: 10028, 7: 9998, 6: 9994, 2: 9982, 4: 9951, 1: 9898, 5: 9830})

real    0m0.236s
user    0m0.206s
sys 0m0.016s

Usually when the question is "why is my multiprocessing version of X slower than a single-process version", the answer is "because your task is too easy and so the overhead of creating processes outweighs the speed gain". — BrenBarn, Nov 02 '16 at 06:25
You're creating a separate process for each individual input! Don't do that. Look at `multiprocessing.Pool`. — BrenBarn, Nov 02 '16 at 06:41
Ah! But that's tricky with `Pool` and a shared `Manager` dict =( . Any idea how would the lock work with `Pool` and `Manager().dict()`? — alvas, Nov 02 '16 at 07:04
Have you looked at examples like [this](http://stackoverflow.com/questions/11937895/python-multiprocessing-manager-initiates-process-spawn-loop)? — BrenBarn, Nov 02 '16 at 07:19
Thanks for the link, looking at that. But the example shows a list appending that doesn't have the complication of a dict that needs to be accessed, locked and updated. BTW, still looking closer at the link you posted. — alvas, Nov 02 '16 at 07:28
It sounds like maybe you should rethink your question (or ask a new one). It seems like what you're really asking is more about how to structure such code. In your example, it's unclear why you need the Manager. Why not just use a Pool where each process counts the data it sees, and then you take the resulting counts and add them, in a map-reduce sort of style? — BrenBarn, Nov 02 '16 at 07:41
Updated with the pool, but i think i'm doing it wrongly with sequential loop through the inputs =( — alvas, Nov 02 '16 at 07:44
Your Pool example looks fine, but it's still an inefficient way to do multiprocessing for this task. For counting, it would make more sense to split the `inputs` into roughly equal-sized chunks and just send one chunk to each process. Calling your function for each individual number in `inputs`, your program will spend a lot of time just passing the ints among the processes. Also `collections.Counter` is basically a dict and is going to be pretty fast, so it will be hard to beat its performance just by distributing it over processes unless you have a much larger input set. — BrenBarn, Nov 02 '16 at 20:26

Why is my Multiprocessing Counter slower than collections.Counter?

0 Answers0