Non-trivial questions can yield but simplified answers:
Performance improvements needs a lot attention, ~[man*decades]
...
Simply put, do not expect to read a few examples and become expert in this.
No. 1: Poor algorithm will never improve just by some (semi-)automated transformation. Smart re-factoring may jump +100% in performance inside the native plain python code ( example below ), yet a well crafted code, matching the close-to-silicon properties of the code-execution device will show such efforts in other light, as the code-improvement of the said ~ +100% performance boom results having almost the same performance once transformed into a jit
-compiled unit. This means that pre-mature optimisations may turn out useless on going into carefully engineered high-performance code. At least you have been warned.
python
is a great tool, I love it for almost-infinitely precise maths. Yet, carving out both ultimate precision and ultimate performance seems closer to the Heisenberg's principle, than Computer Scientists and the more the fans are keen to admit. Just 've spend long time in doing both to be able to put it compact into a few words paragraph.
Q4: Does it make sense to "numba.jit()
" all functions?
numba
is a great tool for stable code-base, so let's start with it:
Low hanging fruit of automated-transformations are easily cherry-picked with numba.jit()
tools. Benchmarking will help you shave-off almost all overheads your code need not carry with.
If depending on code-elements, that the still evolving numba.jit()
code-transformers cannot transcode, your are done. Having worked with numba
since it's very initial releases, { list | dict | class | ... }
are killers for any further dreams of having a code (auto-)converted a bit nearer to the silicon. Also all the there referred functions have to be able to get numba.jit()
, so almost forget about having some high-level import
-ed code-bases easily translated using numba
, if their original code was not systematically designed with numba
in mind.
Q5: Should I change all formats of my data to float64
?
what disadvantage could happen? ( at the moment they are "standard" numeric )
float32
has, besides a halved static sizing of the memory-footprint in [SPACE]
-domain, a few principal disadvantages.
Some modules ( typically those, that inherit from FORTRAN numerical solvers and similar heritage ) auto-convert any externally passed data into their local float64
replicas ( so both [SPACE]
and [TIME]
penalty grows, outside of your domain of control ).
Better expect an increased code-execution penalty in [TIME]
-domain, as non-aligned cell-boundaries are expensive ( this goes deep into assembly-level of the code and the CPU-instruction sets and cache-hierarchies to master all details on this level ).
May see almost ~ 3x slower execution on float32
in benchmarks below.
Q6: Is there any checklist from which I can see when a loop can be vectorized?
Auto-vectorising transformers are nothing less than a Nobel-prize target.
Some tweaking can be done by smart and handy designers. Do not expect any low-hanging fruit in this domain for any more complex operations, than a trivial broadcasting or some easy to design numpy striding trics.
Professional code-tranformation packages are expensive ( someone has to pay the expertise collected over many [man*years] ) and typically may adjust their ROI only if deployed on scale.
Q1: How can I identify which part of my code can be run in parallel?
You are happy not to have to design your code to run in a true-[PARALLEL]
, but a "just"-[CONCURRENT]
fashion. If someone says parallel, check if the system indeed requires to meet all conditions for a true-[PARALLEL]
process-scheduling, in majority of cases a "just"-[CONCURRENT]
scheduling is just what the speaker asked for ( details go beyond the scope of this post ). Python GIL-stepped locking prevents any but sub-process-based split of workflow, but at a cost, so harvest your independence of processing, as this will reward your intentions without paying any penalty of additional overhead add-on costs, if going against rules of the overhead-strict Amdahl's Law.
Benchmark, benchmark, benchmark:
The actual cost/effect ratio will have to be validated on respective version of python, numba, CPU / cache architecture, so benchmarking is the only way to confirm any improvement ( at costs ).
The following example shows the saving for a trivial Exponential Moving Average function, implemented in several more or less smart manner.
def plain_EMA_fromPrice( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice_float32( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float32_nopython( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice_float64( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float64_nopython( N_period, aPriceVECTOR ):
...
def plain_EMA_fromPrice2( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice2_float32( N_period, aPriceVECTOR ):
...
@numba.jit( "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float32_nopython( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice2_float64( N_period, aPriceVECTOR ):
...
@numba.jit( "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float64_nopython( N_period, aPriceVECTOR ):
...
Improving performance
from 710 [us] -> 160 [us]
, by code-re-factoring and memory-alignment, next
downto -> 12 ~ 17 [us]
by numba.jit()
:
>>> aPV_32 = np.arange( 100, dtype = np.float32 )
>>> aPV_64 = np.arange( 100, dtype = np.float32 )
>>> aClk.start();_ = plain_EMA_fromPrice( 18, aPV_32 );aClk.stop()
715L
723L
712L
722L
975L
>>> aClk.start();_ = plain_EMA_fromPrice( 18, aPV_64 );aClk.stop()
220L
219L
216L
193L
212L
217L
218L
215L
217L
217L
>>> aClk.start();_ = numba_EMA_fromPrice_float32( 18, aPV_32 );aClk.stop()
199L
15L
16L
16L
17L
13L
16L
12L
>>> aClk.start();_ = numba_EMA_fromPrice_float64( 18, aPV_64 );aClk.stop()
170L
16L
16L
16L
18L
14L
16L
14L
17L
>>> aClk.start();_ = numba_EMA_fromPrice_float64_nopython( 18, aPV_64 );aClk.stop()
16L
17L
17L
16L
12L
16L
14L
16L
15L
>>> aClk.start();_ = plain_EMA_fromPrice2( 18, aPV_32 );aClk.stop()
648L
654L
662L
648L
647L
>>> aClk.start();_ = plain_EMA_fromPrice2( 18, aPV_64 );aClk.stop()
165L
166L
162L
162L
162L
163L
162L
162L
>>> aClk.start();_ = numba_EMA_fromPrice2_float32( 18, aPV_32 );aClk.stop()
43L
45L
43L
41L
41L
42L
>>> aClk.start();_ = numba_EMA_fromPrice2_float64( 18, aPV_64 );aClk.stop()
17L
16L
15L
17L
17L
17L
12L
>>> aClk.start();_ = numba_EMA_fromPrice2_float64_nopython( 18, aPV_64 );aClk.stop()
16L
15L
15L
14L
17L
15L