Python, using multiprocess is slower than not using it

Question

After spending a lot of time trying to wrap my head around multiprocessing I came up with this code which is a benchmark test:

Example 1:

from multiprocessing  import Process

class Alter(Process):
    def __init__(self, word):
        Process.__init__(self)
        self.word = word
        self.word2 = ''

    def run(self):
        # Alter string + test processing speed
        for i in range(80000):
            self.word2 = self.word2 + self.word

if __name__=='__main__':
    # Send a string to be altered
    thread1 = Alter('foo')
    thread2 = Alter('bar')
    thread1.start()
    thread2.start()

    # wait for both to finish

    thread1.join()
    thread2.join()

    print(thread1.word2)
    print(thread2.word2)

This completes in 2 seconds (half the time of multithreading). Out of curiosity I decided to run this next:

Example 2:

word2 = 'foo'
word3 = 'bar'

word = 'foo'
for i in range(80000):
    word2 = word2 + word

word  = 'bar'
for i in range(80000):
    word3 = word3 + word

print(word2)
print(word3)

To my horror this ran in less than half a second!

What is going on here? I expected multiprocessing to run faster - shouldn't it complete in half Example 2's time given that Example 1 is Example 2 split into two processes?

Update:

After considering Chris' feedback, I have included the 'actual' code consuming the most process time, and lead me to consider multiprocessing:

self.ListVar = [[13379+ strings],[13379+ strings],
                [13379+ strings],[13379+ strings]]

for b in range(len(self.ListVar)):
    self.list1 = []
    self.temp = []
    for n in range(len(self.ListVar[b])):
        if not self.ListVar[b][n] in self.temp:
            self.list1.insert(n, self.ListVar[b][n] + '(' + 
                              str(self.ListVar[b].count(self.ListVar[b][n])) +
                              ')')
           self.temp.insert(0, self.ListVar[b][n])

   self.ListVar[b] = list(self.list1)

score 14 · Answer 1 · answered Jan 08 '12 at 09:04

Multiprocessing could be useful for what you're doing, but not in the way you're thinking about using it. As you're basically doing some computation on every member of a list, you could do it using the multiprocessing.Pool.map method, to do the computation on the list members in parallel.

Here is an example that shows your code's performance using a single process and using multiprocessing.Pool.map:

from multiprocessing import Pool
from random import choice
from string import printable
from time import time

def build_test_list():
    # Builds a test list consisting of 5 sublists of 10000 strings each.
    # each string is 20 characters long
    testlist = [[], [], [], [], []]
    for sublist in testlist:
        for _ in xrange(10000):
            sublist.append(''.join(choice(printable) for _ in xrange(20)))
    return testlist

def process_list(l):
    # the time-consuming code
    result = []
    tmp = []
    for n in range(len(l)):
        if l[n] not in tmp:
            result.insert(n, l[n]+' ('+str(l.count(l[n]))+')')
            tmp.insert(0, l[n])
    return result

def single(l):
    # process the test list elements using a single process
    results = []
    for sublist in l:
        results.append(process_list(sublist))
    return results

def multi(l):
    # process the test list elements in parallel
    pool = Pool()
    results = pool.map(process_list, l)
    return results

print "Building the test list..."
testlist = build_test_list()

print "Processing the test list using a single process..."
starttime = time()
singleresults = single(testlist)
singletime = time() - starttime

print "Processing the test list using multiple processes..."
starttime = time()
multiresults = multi(testlist)
multitime = time() - starttime

# make sure they both return the same thing
assert singleresults == multiresults

print "Single process: {0:.2f}sec".format(singletime)
print "Multiple processes: {0:.2f}sec".format(multitime)

Output:

Building the test list...
Processing the test list using a single process...
Processing the test list using multiple processes...
Single process: 34.73sec
Multiple processes: 24.97sec

I couldn't decide who to give the points too :S Yours and David's answers very good. I thought I'd give him the points because he has less, but I'm sure I will be using this code in the future. Thanks I've learn't alot — Rhys, Jan 08 '12 at 09:48
@Rhys no problem ;) as soon as this has been useful, I'm happy. (you can give points to multiple answers, but you can only chose one as THE answer) — mdeous, Jan 08 '12 at 09:55
This is a great idea in general, and something every novice to `multiprocessing` should be exposed to as soon as possible… but it doesn't actually address his problem. He already had half the work being done on each process; unless the tasks are wildly variable in time required (which they aren't in either your example or his), adding a pool just adds a bit of extra overhead. It may still be worth doing for readability and organization, but it not for performance reasons. (Again, that's not true for all problems, just ones like this.) — abarnert, Aug 14 '14 at 17:31

score 12 · Answer 2 · answered Jan 08 '12 at 04:56

12

This example is too small to benefit from multiprocessing.

There's a LOT of overhead when starting a new process. If there were heavy processing involved, it would be negligable. But your example really isn't all that intensive, and so you're bound to notice the overhead.

You'd probably notice a bigger difference with real threads, too bad python (well, CPython) has issues with CPU-bound threading.

answered Jan 08 '12 at 04:56

Chris Eberle

47,994
12
82
119

what would you consider 'heavy processing'. I have Increased the range to 100000 for both examples. Example1 finishes in 17sec! Example2 finishes in 0sec still. I tried to go higher in the range() but Example1 literally did not return after 10 minutes – Rhys Jan 08 '12 at 05:09
@Rhys well for one thing you've got yourself an example that just eats and eats memory, that's bound to cause problems. Real CPU-bound processing code would be like, I dunno, matrix decomposition or something. – Chris Eberle Jan 08 '12 at 05:11
I'm testing this for the following application. To ask a list of strings (a list of 17000 strings), if (each) has any duplicate entries. and If so, to append that string entry with the number of duplicates in brackets ... should I use multiprocessing for this? – Rhys Jan 08 '12 at 05:24
Rhys: perhaps you should post a snippet of your actual code? There might be other performance optimizations we could suggest. – David Robinson Jan 08 '12 at 05:34
Done, the main source of the slowdown has been added too my main question under update – Rhys Jan 08 '12 at 06:38
what I don't understand is you say there is 'LOT of overhead when starting a new process'. How does this account for the fact that the multiprocess slows INCREMENTALLY as I increase the range(). Judging from what you say there should be a once off cost of time per subprocess, not an dynamic one – Rhys Jan 08 '12 at 07:23
2

@Rhys let me give you the single most important piece of advice ever given to me: when it comes to optimization, measure, measure, and measure again. Unless you start running profilers to see exactly where the bottleneck is happening, it's all speculation. Processes do have higher overhead than threads. This is a fact. However I can't say with 100% confidence how this will impact your particular code. – Chris Eberle Jan 08 '12 at 07:47
I agree. As a side thought, If running in the main thread is the fastest result I can get. Then it seems logical that If i had to create two subprocess.popen to two binary files and pipe half the calculations to each. There would be a once off cost to open each binary but the calculation would be done within the 'main thread' of each binary, hence halving the process time of doing in in one main thread. I will try this and report back. Maybe speculation will lead to a favorable result – Rhys Jan 08 '12 at 07:59
"You'd probably notice a bigger difference with real threads". Why? The initial cost of spinning up the processes may be much higher than threads on some platforms (mainly Windows), and the cost of sending data to the processes will be higher on most platforms, but as far as actually doing the work, they will run at least as fast, with the same amount of overhead, as threads. So threads are only faster when the tasks are way too small to be parallelizing in the first place (as with the OP's question). – abarnert Aug 10 '14 at 23:13
@abarnert you hit the nail on the head. It's that startup penalty that matters here, not the raw processing time. Given how small the example is, far more time will be spend initializing and shutting down the processes than actually doing the computation. It would be quicker to just do the computation in a serial manner. In this case. – Chris Eberle Aug 14 '14 at 03:01
@Chris: Agreed; my point was that when the overhead is too high for child processes, it's almost always too high for threads too; your answer, at least as I read it, implies that threads would be useful for cases like this if not for the GIL. – abarnert Aug 14 '14 at 17:25

David Robinson · Accepted Answer · 2012-01-08T10:16:10.740

ETA: Now that you've posted your code, I can tell you there is a simple way to do what you're doing MUCH faster (>100 times faster).

I see that what you're doing is adding a frequency in parentheses to each item in a list of strings. Instead of counting all the elements each time (which, as you can confirm using cProfile, is by far the largest bottleneck in your code), you can just create a dictionary that maps from each element to its frequency. That way, you only have to go through the list twice- once to create the frequency dictionary, once to use it to add frequency.

Here I'll show my new method, time it, and compare it to the old method using a generated test case. The test case even shows the new result to be exactly identical to the old one. Note: All you really need to pay attention to below is the new_method.

import random
import time
import collections
import cProfile

LIST_LEN = 14000

def timefunc(f):
    t = time.time()
    f()
    return time.time() - t


def random_string(length=3):
    """Return a random string of given length"""
    return "".join([chr(random.randint(65, 90)) for i in range(length)])


class Profiler:
    def __init__(self):
        self.original = [[random_string() for i in range(LIST_LEN)]
                            for j in range(4)]

    def old_method(self):
        self.ListVar = self.original[:]
        for b in range(len(self.ListVar)):
            self.list1 = []
            self.temp = []
            for n in range(len(self.ListVar[b])):
                if not self.ListVar[b][n] in self.temp:
                    self.list1.insert(n, self.ListVar[b][n] + '(' +    str(self.ListVar[b].count(self.ListVar[b][n])) + ')')
                    self.temp.insert(0, self.ListVar[b][n])

            self.ListVar[b] = list(self.list1)
        return self.ListVar

    def new_method(self):
        self.ListVar = self.original[:]
        for i, inner_lst in enumerate(self.ListVar):
            freq_dict = collections.defaultdict(int)
            # create frequency dictionary
            for e in inner_lst:
                freq_dict[e] += 1
            temp = set()
            ret = []
            for e in inner_lst:
                if e not in temp:
                    ret.append(e + '(' + str(freq_dict[e]) + ')')
                    temp.add(e)
            self.ListVar[i] = ret
        return self.ListVar

    def time_and_confirm(self):
        """
        Time the old and new methods, and confirm they return the same value
        """
        time_a = time.time()
        l1 = self.old_method()
        time_b = time.time()
        l2 = self.new_method()
        time_c = time.time()

        # confirm that the two are the same
        assert l1 == l2, "The old and new methods don't return the same value"

        return time_b - time_a, time_c - time_b

p = Profiler()
print p.time_and_confirm()

When I run this, it gets times of (15.963812112808228, 0.05961179733276367), meaning it's about 250 times faster, though this advantage depends on both how long the lists are and the frequency distribution within each list. I'm sure you'll agree that with this speed advantage, you probably won't need to use multiprocessing :)

(My original answer is left in below for posterity)

ETA: By the way, it is worth noting that this algorithm is roughly linear in the length of the lists, while the code you used is quadratic. This means it performs with even more of an advantage the larger the number of elements. For example, if you increase the length of each list to 1000000, it takes only 5 seconds to run. Based on extrapolation, the old code would take over a day :)

It depends on the operation you are performing. For example:

import time
NUM_RANGE = 100000000

from multiprocessing  import Process

def timefunc(f):
    t = time.time()
    f()
    return time.time() - t

def multi():
    class MultiProcess(Process):
        def __init__(self):
            Process.__init__(self)

        def run(self):
            # Alter string + test processing speed
            for i in xrange(NUM_RANGE):
                a = 20 * 20

    thread1 = MultiProcess()
    thread2 = MultiProcess()
    thread1.start()
    thread2.start()
    thread1.join()
    thread2.join()

def single():
    for i in xrange(NUM_RANGE):
        a = 20 * 20

    for i in xrange(NUM_RANGE):
        a = 20 * 20

print timefunc(multi) / timefunc(single)

On my machine, the multiprocessed operation takes up only ~60% the time of the singlethreaded one.

Hey David, thanks alot for the great code. I'll accept this answer. One thing though. perhaps I wasn't clear enough in the question. the bracketed count of strings should only count those strings in each list. for instance. [['betty', 'harry', 'sam', 'sam'], ['gary', 'larry', 'fed', 'sam'] ...] --- should return --- [['betty(1)', 'harry(1)', 'sam(2)', 'sam(2)'], ['gary(1)', 'larry(1)', 'fed(1)', 'sam(1)'] ...]. Currently when I pdb.set_trace() and call to print eg ListVar[0] and find an entry with '(2)' or '(3)' and search for the corresponding string inside ListVar[0] ... this is no other — Rhys, Jan 08 '12 at 10:00
In both my code and yours, it does count only the strings in each list (not in the overall, nested list). Notice that the frequency dictionary is recreated for each inner_lst. Also, you show "sam(2)" as appearing twice in your example here, but the way you wrote the code, where it checks the temp array for ones that already exist, it would appear only once: [['betty(1)', 'harry(1)', 'sam(2)'], ['gary(1)', 'larry(1)', 'fed(1)', 'sam(1)']]. Both my method and yours return exactly that. — David Robinson, Jan 08 '12 at 10:06
ok thanks, your right about the temp thing there should only be 1 sam(2), I will recheck that it is counting correctly again when I get back home — Rhys, Jan 08 '12 at 19:23
yea, you are right. all duplicates are deleted so I wouldn't find one which is exactly what I needed. Thanks a bunch — Rhys, Jan 10 '12 at 09:00

score 0 · Answer 4 · edited May 27 '14 at 10:18

This thread has been very useful!

Just a quick observation over the good second code provided by David Robinson above (answered Jan 8 '12 at 5:34), which was the code more suitable to my current needs.

In my case I had previous records of the running times of a target function without multiprocessing. When using his code to implement a multiprocessing function his timefunc(multi) didn't reflect the actual time of multi, and it rather appeared to reflect the time expended in the parent.

What i did was to externalise the timing function and the time that I got looked more like expected:

 start = timefunc()
 multi()/single()
 elapsed = (timefunc()-start)/(--number of workers--)
 print(elapsed)

In my case with a double core the total time carried out by 'x' workers using the target function was twice faster than running a simple for-loop over the target function with 'x' iterations.

I am new to multiprocessing so please be cautious with this observation though.

You really shouldn't be timing this way; see the [`timeit`](https://docs.python.org/3/library/timeit.html) library for the right way to measure wall-clock time taken by your code, but briefly: `elapsed = timeit.timeit(multi, number=100, repeat=3)` will make sure to use the right clock function, take care of things you didn't think of like disabling the GC cycle detector, run your code 100 times, repeat the test 3 times and take the lowest value so you can be sure there were no externalities interfering with the timing, etc. — abarnert, Aug 14 '14 at 17:34

Python, using multiprocess is slower than not using it

Update:

4 Answers4

Linked