How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

Question

My task: take 3 lists of ints, each with some multiplier, and see if the elements can be rearranged to make two lists (with larger multipliers).

I have code that does this - looped over my whole data set, it takes about 15 seconds: (EDIT: fixed errors)

%%cython
cdef bint my_check(
    list pattern1, 
    list pattern2, 
    list pattern3, 
    int amount1,
    int amount2,
    int amount3
):

    cdef dict all_items = dict()
    cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
    cdef bint bad_split = False

    # Pool the items together.

    for i in range(len(pattern1)):
        all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
    for i in range(len(pattern2)):
        all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
    for i in range(len(pattern3)):
        all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3

    # Iterate through possible split points:
    for m1 in range(total_amount//2, total_amount):
        m2 = total_amount - m1

        # Split items into those with quantities divisible at this point and those without
        divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
        not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}

        # Check that all of the element amounts that are not divisible by m1 are divisible by m2.
        for i in not_divisible:
            if not_divisible[i]%m2 != 0:
                bad_split = True
                break
        # If there is an element that doesn't divide by either, try the next split value.
        if bad_split:
            continue
        
        items1 = {i:divisible[i]//m1 for i in divisible}
        items2 = {i:not_divisible[i]//m2 for i in not_divisible}

        if <some other stuff here>:
            return True

    # Tried all of the split points
    return False

Then if this returns True, I run another function to do the combination. On my data set, the my_check() function is being called > 150,000 times (and taking the bulk of the time) and the other function < 500 times, so I'm not too concerned with optimizing that one.

I'd like to parallelize this to improve the performance, but what I've found:

my first thought was to use numpy functions to take advantage of vectorization, by converting all_items to a numpy array, using np.mod() and np.logical_not() to split the items, and other numpy functions in the last if clause, but that blows the time up by 3-4x compared to using the dict comprehension
if I switch the m1 range to a Cython prange, the compiler complained about using Python objects without the GIL. I switched the dicts to cdef'd numpy arrays, but that was even slower. I tried using memoryviews, but they don't seem to be easily manipulated? I read in another question here that slices can't be assigned to variables, so I don't know how I'd work with them. It won't let me cdef new variables inside the for loop.

Since I'm running at different values of m1, and terminating as soon as any of them return True, it should be parallelizable without worrying about race conditions.

What should my approach be here? Numpy? Cython? Something else?

I'm happy to post more detailed errors from any of my attempts, but figured that posting them all would get overwhelming. I haven't been able to get profiling or line profiling working for this - I've added the relevant # cython: statements to the top of the Jupyter notebook cell, but it doesn't find anything when I run it.

EDIT: Per @DavidW's answer I've replaced the middle chunk of code with the following, which cuts the time in half:

items1 = dict()
items2 = dict()
bad_split = False

for k,v in all_items.items():
    if v % m1 == 0:
        items1[k] = v//m1
    elif v % m2 == 0:
        items2[k] = v//m2
    else:
        bad_split = True
        break

I'd still like to find some way of taking advantage of my multi-core processor if that's possible.

I think the largest it should be in this example would be about 12 key-value pairs. In other examples it could be bigger but I'd say always <100. — Jeremy Sadler, Apr 14 '21 at 19:19
Could you post a working gist of your code with sample inputs/outputs? It's somewhat hard to tell what it does/should do. Btw a small optimization: you can get the quotient and the remainder at the same time with [divmod](https://docs.python.org/3/library/functions.html#divmod) — gsakkis, Apr 20 '21 at 09:43

DavidW · Accepted Answer · 2021-04-15T16:46:02.150

There's definitely some improvements you can make to the loops that doesn't change the fundamental approach but may be faster. I haven't timed these so it's worth doing that rather than taking my word for it.

for i in range(len(pattern1)):
    all_items[pattern1[i] = all_items.get(pattern1[i],0) + amount1

(Ignoring the syntax error). It's generally more ideomatic to iterate by item rather than over a range, and it avoids two lookups (sometimes that isn't true in Cython, for example iterating over numpy arrays, but for a list it's probably true):

for pattern1_i in pattern1:
    all_items[pattern1_i] = all_items.get(pattern1_i,0) + amount1

More significantly you have two loops:

divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 != 0}

You're wasting a lot of time doing dict-lookups when you could iterate directly over both keys and values. For example

divisible = {k: v for k, v in all_items.items() if v//m1 == 0}

But you're also looping over the dictionary twice and performing the same test twice.

divisible = {}
not_divisible = {}
for k, v in all_items.items():
   if v//m1 == 0:
     divisible[k] = v
   else:
     not_divisible[k] = v

It might well be possible to translate your algorithm to something involving Numpy arrays, but it's a fairly significant change and beyond my interest here.

Addendum: I'm increasingly reluctant to recommend people use C++ classes in Cython these days. Mainly because a) it can often lead to quite awkward code, b) people tend to use it in a cargo-culty way because "it's C++ so it must be faster than Python objects, and c) people tend to forgot about the cost of converting their objects to/from C++ at the start and end of every function.

However, in this case it might actually be a good choice, since your dict objects are uniformly typed, and entirely contained with a single function. The key substitution is dict -> unordered_map.

What you want to do (rough outline) is

from libcpp.unordered_map cimport unordered_map

Then type all_items, items1 and items2 as cdef unordered_map[int, int (I think...). You do this typing outside the loop. The rest of your code then remains largely the same (you may need to find a substitute for dict.get...).

Once you've got it working as a serial calculation, you should be able to turn your for m1 in range(total_amount//2, total_amount): into a prange loop, and assuming everything is correctly typed then this should work in parallel. Obviously if <some other stuff here> is a big unknown.

You must treat all_items as strictly read-only during the loop to avoid race-conditions. However, items1 and items2 should be correctly identified as loop-local variables by Cython I hope.

Here's a fairly similar answer to use as a starting point. For future readers: please think twice about whether you really need to convert all your Python objects to C++ ones; you probably don't

Thank you! Iterating by item slowed it down, but shifting the divisible/not_divisible/items1/items2 comprehensions to a single for loop over keys and values cut the total time down from 15 seconds to 6. — Jeremy Sadler, Apr 15 '21 at 08:25
There's a few other for loops in your code that could be iterated by keys and values. You should look at them too (but the difference will probably be smaller for them, since they're a little simpler) — DavidW, Apr 15 '21 at 08:45
@JeremySadler I've added a very-sketched out idea of how you can use C++ classes to get a bit more speed/make your algorithm parallel. Obviously there's a lot of blanks to fill it. Please try it in two steps: convert the types _then_ think about `prange`. — DavidW, Apr 15 '21 at 16:47
Great, thank you! Using a C++ unordered_map and pair to iterate instead of a dict and a tuple cut the time down from about 6 seconds to 4.5. When I put the `prange` in, it complied and ran, but the function stopped returning `True`, so diagnosing that is my next task. — Jeremy Sadler, Apr 16 '21 at 18:21
Potentially you have a race condition between a thread reaching `break` and a thread reaching `True`. To me that suggests it isn't really parallel (or at least needs some careful refactoring to become parallel) — DavidW, Apr 16 '21 at 19:20
I tried putting the `return True` statement immediately after the `for m1 in prange`, so it should *always* hit, and it's still not ever returning `True`; I tried sticking the `return` under a `with gil` and then the cell wouldn't compile. So, the behavior is a little odd. — Jeremy Sadler, Apr 18 '21 at 16:20

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

1 Answers1