The problem is more related with the efficiency of the processing, than with using any concurrent processing.
Let me propose a few steps to be done even before going into any further steps :
1 )
Given 1E6
calls will be done, optimize all function overheads - as each [us]
saved will save another whole [s]
on the finish line
2 )
avoid any unrelated overheads ( why to make python create a variable, maintain a variable name, if it will never be re-used ? You just pay costs without ever returning any benefit from doing so ).
3 )
do your best in boosting the performance & if possible, numba-compile the code of the objective()
function,
here every [us]
shaved-off from the someFun( aFloat64, anInt64[], ... )
run-time will save you further ~ 3 [s]
-times-the-number-of-minimize()
-er-re-iterations on the finish-line...
Worth doing so, isn't it?
AN IMMEDIATE BONUS :
A cheap & low-hanging fruit :
Vectorize & harness the full powers of Numpy tools
>>> aClk.start(); _ = np.sum( # _______________________________ AVOID I/O on printing here by assignment into _ var
[ np.power( ( arg0[0] - v0 ), 2 ),
np.power( ( arg0[1] - v1 ), 2 ),
np.power( ( arg0[2] - v2 ), 2 )
] #___________________________ LIST-constructor overheads
); aClk.stop() # _______________.stop() the [us]-clock
225 [us]
164 [us]
188
157
175
199
201
201
171 [us]
>>> RSS = _
>>> RSS
0.16506628505715615
Cross-validate the correctness of the method :
>>> arg0_dif = arg0.copy()
>>> vV = np.ones( 3 ) * 1.23456789
>>> arg0_dif[:] -= vV[:]
>>> np.dot( arg0_dif, arg0_dif.T )
0.16506628505715615
+5x ~ +7x
FASTER CODE :
>>> aClk.start(); _ = np.dot( arg0_dif, arg0_dif.T ); aClk.stop()
39
38
28
28
27 [us] ~ +5x ~ +7x FASTER just by using more efficient way to produce the RSS calculations
The Next Step :
"""
@numba.jit( signature = 'f8[:], f8[:], i8[:]', nogil = True, ... ) """
def aMoreEfficientObjectiveFUN( x, arg0, arg1 ):
###############################################################################################
# BEST 2 VECTORISE THE someFun() into someFunVECTORISED(),
# so as to work straight with arg0-vector and return vector of differences
#rg0[:] -= someFunVECTORISED( arg0, arg1, ... ) ##########################################
arg0[0] -= someFun( arg0[0], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
arg0[1] -= someFun( arg0[1], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
arg0[2] -= someFun( arg0[2], arg1, ... ) # *model value for arg0 which needs arg0[0] and arg1*
return np.dot( arg0, arg0.T ) # just this change gets ~ 5x ~ 7x FASTER processing ............
Having done this properly, the rest is easy :
In a case all the add-on overhead costs of the revised, overhead-strict Amdahl's Law can still justify paying all the add-on costs of :
+ SER/DES (on pickling before sending parameters)
+ XFER-main2process
+ SER/DES (on pickling results before sending them back)
+ XFER-process2main,
the multiprocessing
may help you split the 1E6
, fully independent calls and recollect the results back into the map_tensor[:,i0,i1,i2]
~ ( map_0[i0,i1,i2], map_1[i0,i1,i2], map_2[i0,i1,i2] )
In a case all the add-on overhead costs do not get justified, keep the work flow without distribution among a few processes, and if indeed curious and keen to experiment, may try to run this inside a thread-based concurrent-processing, as GIL-free processing ( escaping, in numpy
-parts, which do not require to hold the GIL-lock, the costs of GIL-driven re-[SERIAL]
-isation ) may exhibit some latency-masking on overlaid concurrent-processing. The in-vivo test will either proof this or show, there is no additional advantage available from doing this in python as-is ecosystem.
Epilogue :
Would be nice & fair to give us your feedback on runtime improvements, partly from code-efficiency ( from initial ~ 27 minutes
(as-was state)
-- after the smart np.dot()
trick
-- +after a fully vectorised objective( x, arg0, arg1 )
, where someFun()
can process the vectors and smart-returns the np.dot()
by itself
-- +after a @numba.jit( ... )
decorated processing, if such results show themselves to be further more efficient