How to improve this bottleneck calculation in Python (use of C++?)

Question

I have been working on a project where at some point I require high optimization for the algorithm used in the calculations. I would like to know which way is better to go, if this can be done efficiently in Python, or on the other hand, I should try to implement a C++ routine that does exactly this; here is a sample code of the bottleneck I am experiencing:

import numpy as np
from numba import jit

from time import time

# this generates the outer product by rows of arrays 1 and 2
@jit( nopython=True  )
def arrays_product( array_1, array_2 ):
    
    dim_x       = array_1.shape[0]
    dim_y       = array_2.shape[0]
    total_dim   = dim_x*dim_y
    total_depth = array_1.shape[1]
    
    result      = np.empty( ( total_dim, total_depth ), dtype =np.complex128  ) 
    
    counter = 0
    for i in range( dim_x ):
        for j in range( dim_y ):            
            result[ counter, : ] = np.multiply( array_1[ i, : ], array_2[ j, : ] )
            
            counter += 1
            
    return result


def create_complex_array( dim ):    
    return np.random.rand( dim  ) + np.random.rand( dim )*1j


prefactor        = np.pi
total_iterations = 40
tot_points       = 3000
dim_0            = 4000
dim              = 100   # the outer product dimension will be dimm**2
total_combs      = 10000

#create some dictionary keys, that are composed of 12 integers each. To each key, a complex array is associated

combination_keys     = np.random.randint( low=-128, high = 127, size=(total_combs, 12), dtype=np.int8 )
combs_dict           = { tuple( x ): create_complex_array( tot_points )  for x in combination_keys }

comb_keys_array      = np.array( list( combs_dict.keys()  ) ) # put keys in array form

#source array used to make the calculations
the_source_array     = np.random.rand( dim_0, tot_points  ) + np.random.rand( dim_0, tot_points )*1j


#create an array of masks, from which different keys of the dictionary will be selected later
comb_masks_p = []


for k in range( total_iterations ):
    
    random_choice  = np.random.randint(  low=0, high = total_combs - 1, size = dim**2   )  
    comb_masks_p.append( random_choice  )
    
comb_masks  = np.array( comb_masks_p )

#this will select two different sets of elements of the_source_array
array_masks = np.random.randint( low=0,    high=dim_0 - 1, size=( total_iterations, 2, dim  )  )


for k in range( total_iterations ):
    
    ts = time()
    
    selection_combs = comb_keys_array[  comb_masks[ k, : ]  ] 
    keys_in_tuples  = list( map( tuple, selection_combs ) ) # convert selected keys for the dictionary in list of tuples
    
    # we iterate over the subset of selected keys and generate the corresponding complex array
    # this will be slow when the total number of keys is high
    array_1         = np.array( [ combs_dict[ key ] for key in keys_in_tuples ] )    
    
    # select left and right components to generate outer product by rows
    
    mask_left       = array_masks[ k, 0, : ]
    mask_right      = array_masks[ k, 1, : ]
    array_left      = the_source_array[  mask_left, : ] 
    array_right     = the_source_array[  mask_right, : ]      
    
    # this part is also slow 
    product_array   = arrays_product( array_left, array_right )
    
    # finally, use array_1 to make a product element-wise and reduce on the first axis
    
    result          = ( prefactor*array_1*product_array ).sum( axis=0 )

    print("Iteration time: ",  time() - ts )

DESCRIPTION: We have a dictionary storing 1-D complex arrays, that need to be selected on a given iteration to build up a 2D array from them. The way to select the arrays is made by unique n-tuple keys accesing the dictionary elements. The created NumPy array from the dictionary always has the same dimensions ( dim**2 , tot_points ).

On top of that, we have a "source_array" (2D) from which values will be selected by rows. The way to select the appropiate values on a given iteration is by using fancy indexing or masks. This selection generates two different arrays ( array_left, array_right ), that are used later to generate an outer product in row-wise sense ( see function "arrays_product" with the Numba decorator ). Once this outer "product_array" is created, which has dimension ( dim**2, tot_points ), "then array_1" multiplies elementwise with "product_array", and a sum of the elements is carried over the 0 axis.

Note that in all cases, the critical point is the increasing of the "dim" parameter, as this makes the arrays from the outer product very big, as well as the total number of keys that need to be selected from the dictionary to build "array_1".

WHAT IS BEST? I believe the above code to be highly inefficient for large data-sets. I am looking for a way to optimize it either in pure Python, or by creating a C++ routine that will be called to perform the operations. Questions are:

Can I expect much difference with a C++ approach? I need to pass the dictionary as well as all the other objects and I don't how know much of a gain can I expect here. I am also open to use any other approaches like Cython, but I think there the dictionary object is the big problem, along with some other Python calls that might occur during a single iteration.
Using pure Python, and particularly the numpy module, is there a way to perform the contractions ( last two steps ) into a single call to numpy einsum or similars? I think this should be something straighforward because of the type of operations involved, but I am not capable of finding something.
There is no work-around about the external loop. The external loop is necessary to exist because in my main code, there is an additional step not reflected here ( which is not important for performance issues ).

I hope someone can help, thanks!

*"Using pure Python, and particularly the numpy module"* Numpy is mostly implemented in C, especially when it comes to the workload parts. — Klaus D., Dec 10 '20 at 11:23
first off, `arrays_product( array_1, array_2 )` is just `np.kron(array1[:, None], array2)` — Daniel F, Dec 10 '20 at 11:46
second off, there's way too many questions in one here and none of them are minimal. — Daniel F, Dec 10 '20 at 11:58
@DanielF I believe the np.kron you are suggesting is not of use here if one does not want to run into memory issues... I just tried your suggestion and there is a memory blow-up. There are two main questions, one relating on how to actually make this more efficient in pure python (if possible at all); the second is how much gain is expected to happen in C++ implementation of this specific routine, by making use of the dictionary possibly as a std::map — Zarathustra, Dec 10 '20 at 12:12
`np.kron` is has the exact same output, so if you're having memory errors with one but not the other, I don't know what to tell you. — Daniel F, Dec 10 '20 at 12:30
Also the "this" you want to make more efficient is a page-long codedump with at least 3 discrete parts to it. And then there's a `c++` question. How about this: if you want to ask about `c++`, tag it. I dare you. — Daniel F, Dec 10 '20 at 12:30
Optimizing the last part is quite obvious. A third loop level and `for k in range(total_depth):` and directly making the reduction `result[k] += prefactor*array_11[i*dim_y+j,k]*array_1[i,k]* array_2[j,k]`. (also change the preallocation of res, and passing in arr_1).This will lead to a factor of 3. But for efficient code a complete rewrite is necessary. (no dicts, list comprehensions etc. just simple loops and arrays). In the beginning it would be good to have a question like (inputs, function without globals, output). — max9111, Dec 10 '20 at 13:38
@max911 Thank you for the comment; about the complete rewrite, I also considered passing the comb_keys as an array previously known, and instead of using a dictionary, using a 2D ndarray to allocate the vectors. Then I can use indexing ( the precise indices of the keys comb_keys_arrays ) to get the desired elements, but I see not much of a difference respect to the list comprehension case to build up the array_1, do you mean something along that line? — Zarathustra, Dec 10 '20 at 15:02

How to improve this bottleneck calculation in Python (use of C++?)

0 Answers0