My task: take 3 lists of ints, each with some multiplier, and see if the elements can be rearranged to make two lists (with larger multipliers).
I have code that does this - looped over my whole data set, it takes about 15 seconds: (EDIT: fixed errors)
%%cython
cdef bint my_check(
list pattern1,
list pattern2,
list pattern3,
int amount1,
int amount2,
int amount3
):
cdef dict all_items = dict()
cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
cdef bint bad_split = False
# Pool the items together.
for i in range(len(pattern1)):
all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
for i in range(len(pattern2)):
all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
for i in range(len(pattern3)):
all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3
# Iterate through possible split points:
for m1 in range(total_amount//2, total_amount):
m2 = total_amount - m1
# Split items into those with quantities divisible at this point and those without
divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}
# Check that all of the element amounts that are not divisible by m1 are divisible by m2.
for i in not_divisible:
if not_divisible[i]%m2 != 0:
bad_split = True
break
# If there is an element that doesn't divide by either, try the next split value.
if bad_split:
continue
items1 = {i:divisible[i]//m1 for i in divisible}
items2 = {i:not_divisible[i]//m2 for i in not_divisible}
if <some other stuff here>:
return True
# Tried all of the split points
return False
Then if this returns True, I run another function to do the combination. On my data set, the my_check() function is being called > 150,000 times (and taking the bulk of the time) and the other function < 500 times, so I'm not too concerned with optimizing that one.
I'd like to parallelize this to improve the performance, but what I've found:
- my first thought was to use numpy functions to take advantage of vectorization, by converting
all_items
to a numpy array, usingnp.mod()
andnp.logical_not()
to split the items, and other numpy functions in the last if clause, but that blows the time up by 3-4x compared to using the dict comprehension - if I switch the
m1
range to a Cython prange, the compiler complained about using Python objects without the GIL. I switched the dicts to cdef'd numpy arrays, but that was even slower. I tried using memoryviews, but they don't seem to be easily manipulated? I read in another question here that slices can't be assigned to variables, so I don't know how I'd work with them. It won't let me cdef new variables inside the for loop.
Since I'm running at different values of m1
, and terminating as soon as any of them return True
, it should be parallelizable without worrying about race conditions.
What should my approach be here? Numpy? Cython? Something else?
I'm happy to post more detailed errors from any of my attempts, but figured that posting them all would get overwhelming. I haven't been able to get profiling or line profiling working for this - I've added the relevant # cython:
statements to the top of the Jupyter notebook cell, but it doesn't find anything when I run it.
EDIT: Per @DavidW's answer I've replaced the middle chunk of code with the following, which cuts the time in half:
items1 = dict()
items2 = dict()
bad_split = False
for k,v in all_items.items():
if v % m1 == 0:
items1[k] = v//m1
elif v % m2 == 0:
items2[k] = v//m2
else:
bad_split = True
break
I'd still like to find some way of taking advantage of my multi-core processor if that's possible.