Single process code performs faster than Multiprocessing - MCVE

Question

My attempt to speed up one of my applications using Multiprocessing resulted in lower performance. I am sure it is a design flaw, but that is the point of discussion- How to better approach this problem in order to take advantage of multiprocessing.

My current results on a 1.4ghz atom:

SP Version = 19 seconds
MP Version = 24 seconds

Both versions of code can be copied and pasted for you to review. The dataset is at the bottom and can be pasted also. (I decided against using xrange to illustrate the problem)

First the SP version:

*PASTE DATA HERE*    

def calc():
    for i, valD1 in enumerate(D1):
        for i, valD2 in enumerate(D2):
            for i, valD3 in enumerate(D3):  
                for i, valD4 in enumerate(D4):
                    for i, valD5 in enumerate(D5):
                        for i, valD6 in enumerate(D6):
                            for i, valD7 in enumerate(D7):
                                sol1=float(valD1[1]+valD2[1]+valD3[1]+valD4[1]+valD5[1]+valD6[1]+valD7[1])
                                sol2=float(valD1[2]+valD2[2]+valD3[2]+valD4[2]+valD5[2]+valD6[2]+valD7[2])
    return None

print(calc())

Now the MP version:

import multiprocessing
import itertools

*PASTE DATA HERE*

def calculate(vals):
    sol1=float(valD1[0]+valD2[0]+valD3[0]+valD4[0]+valD5[0]+valD6[0]+valD7[0])
    sol2=float(valD1[1]+valD2[1]+valD3[1]+valD4[1]+valD5[1]+valD6[1]+valD7[1])
    return none

def process():
    pool = multiprocessing.Pool(processes=4)
    prod = itertools.product(([x[1],x[2]] for x in D1), ([x[1],x[2]] for x in D2), ([x[1],x[2]] for x in D3), ([x[1],x[2]] for x in D4), ([x[1],x[2]] for x in D5), ([x[1],x[2]] for x in D6), ([x[1],x[2]] for x in D7))
    result = pool.imap(calculate, prod, chunksize=2500)
    pool.close()
    pool.join()
    return result

if __name__ == "__main__":    
    print(process())

And the data for both:

D1 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D2 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D3 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D4 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D5 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D6 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]
D7 = [['A',7,4],['B',3,7],['C',6,1],['D',12,6],['E',4,8],['F',8,7],['G',11,3],['AX',11,7],['AX',11,2],['AX',11,4],['AX',11,4]]

And now the theory:

Since there is little actual work (just summing 7 ints) there is too much CPU bound data and Interprocess Communication creates too much overhead to make Multiprocessing effective. This seems like a situation where I really need the ability to multithread. So at this point I am looking for suggestions before I try this on a different language because of the GIL.

********Debugging

File "calc.py", line 309, in <module>
    smart_calc()
  File "calc.py", line 290, in smart_calc
    results = pool.map(func, chunk_list)
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
TypeError: sequence index must be integer, not 'slice'

In this case, totallen = 108 and CHUNKS is set to 2. When CHUNKS is reduced to 1, it works.

possible duplicate of [Python, using multiprocess is slower than not using it](http://stackoverflow.com/questions/8775475/python-using-multiprocess-is-slower-than-not-using-it) — ali_m, Aug 10 '14 at 23:05
Have you tried using a larger chunksize? If the amount of work you're doing in each job is lower than the cost of running a job, multiprocessing is going to hurt more than it helps. — abarnert, Aug 10 '14 at 23:06
More importantly, these functions don't seem to do the same thing. Try it with much smaller data sets and print out what you're working on in the inner loop/in each call to `calculate`. As a side benefit, even if that doesn't give you the answer, creating those smaller data sets will allow you to post an [MCVE](http://stackoverflow.com/help/mcve) that other people can debug instead of just guessing. — abarnert, Aug 10 '14 at 23:08
Have you considered a [pandas or numpy](http://scipy.org/) solution? — wwii, Aug 10 '14 at 23:13
@ali_m: The non-accepted answer in that question does probably solve (part of) his problem, but the accepted answer doesn't, so I'm not sure it's useful to close as a dup. But definitely worth having as a linked related question, at least. — abarnert, Aug 10 '14 at 23:14
@abarnert chunksize could possibly help- I will have to learn how to implement. as to your other comment, itertools.product gives me the iterations i need to accomplish the same thing with the nested for-loop. — nodoze, Aug 10 '14 at 23:15
`map` uses a sensible default for `chunksize`: `if chunksize is None: chunksize, extra = divmod(len(iterable), len(self._pool) * 4)`. I guess you could try tweaking it though. — dano, Aug 10 '14 at 23:16
@nodoze: I know what `product` does, but clearly `int(valD1[10][0])` in your MP program is not the same thing as `valD1[10]` in your SP program. Maybe you have some other difference somewhere else that counteracts that, but if so, you haven't explained it, which is why I suspect you're not actually doing the same work in the two cases. — abarnert, Aug 10 '14 at 23:17
@wwii I have considered it but I didnt believe the CSVs to be the bottleneck. can you show me otherwise? — nodoze, Aug 10 '14 at 23:19
@abarnert I think that's a bug in his example code, caused by reducing the original code down some. [He was passing a list of items to the worker previously.](http://stackoverflow.com/questions/25226376/python-no-output-when-using-pool-map-async) . — dano, Aug 10 '14 at 23:19
@dano: When the iterable is an iterator, as in this case, it can only possibly do that by first doing `list(it)`, which could be adding more startup overhead allocating and page swapping than he saves. — abarnert, Aug 10 '14 at 23:20
@abarnert That's exactly what `map` does: `if not hasattr(iterable, '__len__'): iterable = list(iterable)` — dano, Aug 10 '14 at 23:20
@dano: If he was passing a list of items to the worker, but he's passing single values to the single-process version, that's just another reason this is an unequal test and therefore doesn't show anything. At any rate, this is exactly why he should reduce the problem to an MCVE, rather than to something that looks kinda the same but can't actually be tested to see if it really is the same. — abarnert, Aug 10 '14 at 23:21
Actually, that gives me an idea: Use `imap(calculate, prod, chunksize=)`. That will avoid the `list(iterable)` overhead, while still allowing you to specify a large chunk size. — dano, Aug 10 '14 at 23:21
@abarnert if I do absolutely nothing in calculate() except return None it still takes 16 seconds. if you know of a different way to send the iterables in a multiprocessing structure, please let me know. I fully expect I am doing something wrong :) — nodoze, Aug 10 '14 at 23:22
@nodoze: Well then, that's a great MCVE. Why don't you write some stripped-down code that generates dead-simple data and demonstrates your 16 seconds of overhead for doing absolutely nothing with it? Then we can test that and see what your problem is—too many small jobs, swapping trying to create the huge list in the first place, or whatever—instead of trying to guess at what might be happening based on what you might have in your actual data and code. — abarnert, Aug 10 '14 at 23:24
@nodoze You can probably create a good MCVE just by creating a 800,000 item (or however long your iterable is) iterable via `xrange`, and passing that to a worker process via `map`. — dano, Aug 10 '14 at 23:25
@dano: Nice, and it also gives him back an element at a time instead of all of the elements in a list, just in case the problem in his real code is that he's returning big arrays rather than just an int. — abarnert, Aug 10 '14 at 23:26
@dano: This seems to be Python 3, given the `print` as a function. (Yes, he could be using a `__future__` statement, or just writing parens around the single argument to `print` for forward compatibility, but Python 3 is the more obvious guess.) If so `range`, not `xrange`. — abarnert, Aug 10 '14 at 23:27
@abamert 2.7.5, never assume I am doing things correctly :) ill try dano's suggestion and I will create an MCVE — nodoze, Aug 10 '14 at 23:31
@nodoze, I think that the `sum` operations you're doing in the worker processes just aren't expensive enough to offset the IPC overhead, no matter what approach (`map`/`imap`) you take. All you're doing in the workers is adding seven numbers - that's cheap. To get a speed up from `multiprocessing`, you're need to reduce the amount of IPC and increase the amount of work happening in the workers. — dano, Aug 11 '14 at 00:03
@dano after finding the sweetspot for chunksize, imap improved performance dramatically. the 16 second example is now 6. I am wondering if this method will now scale better than the SP version as i increase data? — nodoze, Aug 11 '14 at 00:09
@nodoze Make sure you're including the cost of actually waiting for all the workers to complete when you use `imap`. You have to iterate over the object returned by the `imap` call to ensure all the workers are done. When I try both `imap` and `map` this way, `imap` is only slightly faster, and still much slower than the synchronous version. — dano, Aug 11 '14 at 00:16
@dano what if I somehow chained an n-number of sums together to create more work? i would have to somehow prepare a pile of iteratives for the work? does that even make sense? — nodoze, Aug 11 '14 at 00:20
@dano in my SP example with the nested for-loops, what is an appropriate work/calculation to do as a comparison to return None in my MP example? — nodoze, Aug 11 '14 at 00:31
One big question here: How many cores do you have? Or, more specifically, exactly which Atom do you have? On a 4-core i7, I get 19.578s for the SP version, and 6.875s for the MP version (and 6.335s if I switch it to 8 processes, so surprisingly hyperthreading even helps a tiny bit here). So, you are definitely getting a benefit from multiprocessing… but maybe not enough benefit to offset the costs on, say, a 2-core machine with shared cache and a narrow pipeline? — abarnert, Aug 11 '14 at 04:51
Also, typo in your code: `return none` raises a `NameError`. You wanted `return None`—or just nothing, because a function that falls off the end returns `None` anyway. And raising an exception to get caught by the MP machinery does seem to make a small difference; the 6.875s came down to 6.441s. — abarnert, Aug 11 '14 at 04:53

dano · Accepted Answer · 2014-08-12T01:38:05.113

1

Ok, I think I've figured out actually get a speed boost from multiprocessing. Since your actual source lists aren't very long, it's reasonable to pass them in their entirety to the worker processes. So, if each worker process has copies of the same source lists, then ideally we'd want all of them iterate over different pieces of the lists in parallel, and just sum up that unique slice. Because we know the size of the input lists, we can accurately determine how long itertools.product(D1, D2, ...) will be, which means we can also accurately determine how big each chunk should be to evenly distribute the work. So, we can provide each worker with a specific range of the itertools.product iterator that they should iterate over and sum:

import math
import itertools
import multiprocessing
import functools

def smart_calc(valD1, valD2, valD3, valD4, valD5, valD6, valD7, slices):
    # Build an iterator over the entire data set
    prod = itertools.product(([x[1],x[2]] for x in valD1), 
                             ([x[1],x[2]] for x in valD2), 
                             ([x[1],x[2]] for x in valD3), 
                             ([x[1],x[2]] for x in valD4), 
                             ([x[1],x[2]] for x in valD5), 
                             ([x[1],x[2]] for x in valD6), 
                             ([x[1],x[2]] for x in valD7))

    # But only iterate over our unique slice
    for subD1, subD2, subD3, subD4, subD5, subD6, subD7 in itertools.islice(prod, slices[0], slices[1]):
        sol1=float(subD1[0]+subD2[0]+subD3[0]+subD4[0]+subD5[0]+subD6[0]+subD7[0])
        sol2=float(subD1[1]+subD2[1]+subD3[1]+subD4[1]+subD5[1]+subD6[1]+subD7[1])
    return None

def smart_process():
    CHUNKS = multiprocessing.cpu_count()  # Number of pieces to break the list into.
    total_len = len(D1) ** 7  # The total length of itertools.product()
    # Figure out how big each chunk should be. Got this from 
    # multiprocessing.map()
    chunksize, extra = divmod(total_len, CHUNKS)
    if extra:
        chunksize += 1

    # Build a list that has the low index and high index for each
    # slice of the list. Each process will iterate over a unique
    # slice
    low = 0 
    high = chunksize
    chunk_list = []
    for _ in range(CHUNKS):
        chunk_list.append((low, high))
        low += chunksize
        high += chunksize

    pool = multiprocessing.Pool(processes=CHUNKS)
    # Use partial so we can pass all the lists to each worker
    # while using map (which only allows one arg to be passed)
    func = functools.partial(smart_calc, D1, D2, D3, D4, D5, D6, D7) 
    result = pool.map(func, chunk_list)
    pool.close()
    pool.join()
    return result

Results:

sequential: 13.9547419548
mp: 4.0270690918

Success! Now, you do have to actually combine the results after you have them, which will add additional overhead to your real program. It might end up making this approach slower than sequential again, but it really depends on what you actually want to do with the data.

edited Aug 12 '14 at 01:38

answered Aug 11 '14 at 17:06

dano

91,354
19
222
219

Very Nice. How many 'computed' slices would remain? 4? If the # of remaining slices are low (under 20?) then working with them afterwards would be reasonable. There would be a few remaining housekeeping items, like remove all solutions > certain value (seems easily done at the worked level). And then sorting in descending order- which prompted my original question of using a shared 'global' value (cur_best) ... which caused crazy performance lags, leading us to where we are today. Lots to think about but DEFINITELY headway. – nodoze Aug 12 '14 at 01:41
1

@nodoze We're breaking the work into `cpu_count()` slices. So, assuming your pool has `cpu_count()` workers, each worker would return a single (very large) slice. Now, if you're only interested in a subset of the results, you can pre-sort the slices in each worker and throw away a bunch of the results, which will reduce the size of the list being returned. At that point merging/sorting the returned lists should be fairly straightforward. – dano Aug 12 '14 at 02:44
I finally had some time to test and it seems to be working well. I was looking at a line in 'smart_process(): total_len = len(D1) ** 7' and it got me thinking... Will this work if my data lists are of different sizes? What if D1 is 8 elements long, and D5 is 20? I am trying to test my question but it is slow-going. – nodoze Aug 16 '14 at 16:35
1

@nodoze In that case you'd need to multiply the length of each list, rather than using the exponent shortcut. – dano Aug 16 '14 at 17:06
so... total_len = len(D1) * len(D2) * len(D3) * len(D4) * len(D5) * len(D6) * len(D7) should be ok? anything else in the code? still trying to understand the nuts and bolts. – nodoze Aug 16 '14 at 17:09
1

@nodoze Right. I don't think any other changes are required. – dano Aug 16 '14 at 17:20
I have noticed that sometimes, when I dont have a lot of data, I get an error: `TypeError: sequence index must be integer, not 'slice'`. When I reduce the number of "CHUNKS", it begins to work again. It almost seems like there arent enough combinations and my slice count is too low. I am wondering if there is a way to identify the needed value for CHUNKS based on the total length of my data set? (ie: `total_len = len(D1) ** 7`) – nodoze Sep 15 '14 at 21:28
@nodoze How little data are we talking about? Is `total_len` ending up smaller than `CHUNKS`? Also, the full Traceback would be helpful. – dano Sep 15 '14 at 21:47
info added to the bottom of my original question – nodoze Sep 15 '14 at 21:59
@nodoze Hmm, the exception is happening in the child process. Can you wrap the `smart_calc` in a `try/`except`, and call `traceback.print_exc()` in the `except` block? That way we can see the real stack trace. – dano Sep 16 '14 at 01:04
I hope you are out there- I was wondering: How could we leverage itertools.combination as part of the solution you devised above? For example, lets say we want to iterate through the Cartesian product of `valD1-valD5` (as normal), but at this point we want the combinations of `valD6` and `valD7`, before proceeding. `ie: itertoorls.product of (valD1 * valD2 * valD3 * valD4 * val5 (itertools.combination of valD6 and valD7))` - does this make sense? – nodoze Oct 15 '14 at 01:11
@nodoze I think you would just need to calculate the correct amount of chunks. I think that would be `(len(D1) ** 5) + math.factorial(len(D6) + len(D7)) / (math.factorial(combination_length) * math.factorial(len(D6)+len(D7) - combination_length))`. Assuming I've got the formula for finding the number of combinations in a given set right. Then, in the children, just iterate over `itertools.chain(itertools.product(...), itertools.combinations(itertools.chain(D6, D7), combination_length))`. – dano Oct 15 '14 at 02:25
I already devised a little combocounter: `def combocount(n, r): ans = factorial(n) / ((factorial(n-r))*factorial(r)) return ans` but where you confused me, was the use of `chain`. in the children, I am building my iterator like so: `prod = itertools.product((valD1, valD2,..), itertools.combinations(valD6, combo_length))` (were just combo'ing valD6 for now)... and then I am iterating like so: `for subD1, subD2, subD3, subD4, subD5, subD6 in itertools.islice(prod, slices[0], slices[1]):` thoughts? – nodoze Oct 15 '14 at 03:17
@nodoze Ah, ok, I didn't realize you were including the `combinations` result in the `itertools.product` call. What you've got seems like it should be fine, then. – dano Oct 15 '14 at 03:33
i wrapped it all in a chain just for giggles, and it worked fine... shows you how much i know. – nodoze Oct 15 '14 at 03:51
any chance you can chat? – nodoze Feb 05 '15 at 04:13

Single process code performs faster than Multiprocessing - MCVE

1 Answers1