Parallelization of a 3D optimization problem in python

Question

I am working on an optimization problem with scipy.optimize which is aimed at the computation of some 3D maps. Given 3 real data volumes (vol0, vol1, vol2) my aim is to estimate maps of 3 parameters (map_0, map_1, map_2) by voxel-wise fitting of some function of the voxel intensity to its model.

So far this is the idea I am starting from :

import scipy
import numpy as np
from scipy.optimize import minimize

def objective (x,arg0,arg1):

    vol_model0 = someFun( arg0[0], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
    vol_model1 = someFun( arg0[1], arg1, ... ) # *model value for arg0 which needs arg0[1] and arg1*
    vol_model2 = someFun( arg0[2], arg1, ... ) # *model value for arg0 which needs arg0[2] and arg1*

    RSS = np.sum( [ np.power( ( arg0[0] - vol_model0 ), 2 )
                  + np.power( ( arg0[1] - vol_model1 ), 2 )
                  + np.power( ( arg0[2] - vol_model2 ), 2 ) ]
                  )
    return RSS


arg1 = [1, 2, 3, 4]

vol0 = 5* np.zeros([100,100,100])
vol1 = 3* np.zeros([100,100,100])
vol2 = 4* np.zeros([100,100,100])

map_0 = np.zeros([100,100,100])
map_1 = np.zeros([100,100,100])
map_2 = np.zeros([100,100,100])

x0    = [5, 5, 5] 
bnds  = ( (1,10), (1, 10), (1, 10) )

for i0 in range(0,100):
    for i1 in range(0,100):
        for i2 in range(0,100):
            arg0 = [ vol0[i0,i1,i2], \
                     vol1[i0,i1,i2],  \
                     vol2[i0,i1,i2]    \
                     ]
            res = minimize(objective, x0, args = (arg0,arg1), bounds = bnds)
            map_0[i0,i1,i2], \
            map_1[i0,i1,i2],   \
            map_2[i0,i1,i2] = res.x

My question is :
given this constrained optimization problem, is there a way to make the whole process faster? The nested for loops take a prohibitively long time to complete. I am wondering if there is a way to parallelize this problem or to make it faster.

Would you mind to state a few hard facts: **A )** How long does it take to complete one call to **`minimize( objFUN, x0, args = ( ... ), ... )`** in **`[s]`**? **B )** What is your target acceptable Time-to-Complete **`[s]`**? **C )** What are your hardware resources available for doing so - RAM [GB]? CPU [1(cores)]? Other? *Thank you.* — user3666197, May 22 '20 at 16:20
Hi, thanks for reaching out! A) The minimize function takes approximately 0.1s to run B) For the intended purpose, an acceptable time for it would be a tenth of the current duration C) I have an 8 GB RAM, with Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPUs). Please let me know if you need any further details — Marco Andrea Zampini, May 22 '20 at 16:43

user3666197 · Answer 1 · 2020-05-23T14:14:03.563

The problem is more related with the efficiency of the processing, than with using any concurrent processing.

Let me propose a few steps to be done even before going into any further steps :

1 )
Given 1E6 calls will be done, optimize all function overheads - as each [us] saved will save another whole [s] on the finish line

2 )
avoid any unrelated overheads ( why to make python create a variable, maintain a variable name, if it will never be re-used ? You just pay costs without ever returning any benefit from doing so ).

3 )
do your best in boosting the performance & if possible, numba-compile the code of the objective() function,
here every [us] shaved-off from the someFun( aFloat64, anInt64[], ... ) run-time will save you further ~ 3 [s]-times-the-number-of-minimize()-er-re-iterations on the finish-line...

Worth doing so, isn't it?

AN IMMEDIATE BONUS :
A cheap & low-hanging fruit :
Vectorize & harness the full powers of Numpy tools

>>> aClk.start(); _ = np.sum( # _______________________________ AVOID I/O on printing here by assignment into _ var
                              [ np.power( ( arg0[0] - v0 ), 2 ),
                                np.power( ( arg0[1] - v1 ), 2 ),
                                np.power( ( arg0[2] - v2 ), 2 )
                                ] #___________________________ LIST-constructor overheads
                              ); aClk.stop() # _______________.stop() the [us]-clock
225 [us]
164 [us]
188
157
175
199
201
201
171 [us]

>>> RSS = _
>>> RSS
0.16506628505715615

Cross-validate the correctness of the method :

>>> arg0_dif     = arg0.copy()
>>> vV           = np.ones( 3 ) * 1.23456789
>>> arg0_dif[:] -= vV[:]
>>> np.dot( arg0_dif, arg0_dif.T )
0.16506628505715615

`+5x ~ +7x` FASTER CODE :

>>> aClk.start(); _ = np.dot( arg0_dif, arg0_dif.T ); aClk.stop()
39
38
28
28
27 [us] ~ +5x ~ +7x FASTER just by using more efficient way to produce the RSS calculations

The Next Step :

"""
@numba.jit( signature =    'f8[:], f8[:], i8[:]', nogil = True, ... ) """
def aMoreEfficientObjectiveFUN( x,  arg0,  arg1 ):
    ###############################################################################################
    # BEST 2 VECTORISE THE someFun() into someFunVECTORISED(),
    #                      so as to work straight with arg0-vector and return vector of differences
    #rg0[:]   -= someFunVECTORISED( arg0,    arg1, ... ) ##########################################
    arg0[0]   -= someFun(           arg0[0], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
    arg0[1]   -= someFun(           arg0[1], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
    arg0[2]   -= someFun(           arg0[2], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*

    return  np.dot( arg0, arg0.T ) # just this change gets ~ 5x ~ 7x FASTER processing ............

Having done this properly, the rest is easy :

In a case all the add-on overhead costs of the revised, overhead-strict Amdahl's Law can still justify paying all the add-on costs of :
+ SER/DES (on pickling before sending parameters)
+ XFER-main2process
+ SER/DES (on pickling results before sending them back)
+ XFER-process2main,
the multiprocessing may help you split the 1E6, fully independent calls and recollect the results back into the map_tensor[:,i0,i1,i2]
~ ( map_0[i0,i1,i2], map_1[i0,i1,i2], map_2[i0,i1,i2] )

In a case all the add-on overhead costs do not get justified, keep the work flow without distribution among a few processes, and if indeed curious and keen to experiment, may try to run this inside a thread-based concurrent-processing, as GIL-free processing ( escaping, in numpy-parts, which do not require to hold the GIL-lock, the costs of GIL-driven re-[SERIAL]-isation ) may exhibit some latency-masking on overlaid concurrent-processing. The in-vivo test will either proof this or show, there is no additional advantage available from doing this in python as-is ecosystem.

Epilogue :

Would be nice & fair to give us your feedback on runtime improvements, partly from code-efficiency ( from initial ~ 27 minutes (as-was state)
-- after the smart np.dot() trick
-- +after a fully vectorised objective( x, arg0, arg1 ), where someFun() can process the vectors and smart-returns the np.dot() by itself
-- +after a @numba.jit( ... ) decorated processing, if such results show themselves to be further more efficient

Thanks! I am implementing the changes step by step. An issue I am having is that the argument `arg0` of the objective functions is updated at each iteration of the minimize function, so I need to create a new variable within the objective function anyway. So far I still haven't found a workaround to avoid this. — Marco Andrea Zampini, May 25 '20 at 15:49

Parallelization of a 3D optimization problem in python

1 Answers1

AN IMMEDIATE BONUS :A cheap & low-hanging fruit :Vectorize & harness the full powers of Numpy tools

+5x ~ +7x FASTER CODE :

The Next Step :

AN IMMEDIATE BONUS :
A cheap & low-hanging fruit :
Vectorize & harness the full powers of Numpy tools

`+5x ~ +7x` FASTER CODE :