Non-trivial recoding: How to speed up my program? Cython, numba, multiprocessing and numpy?

Question

I have (or actually working on) a program (some pairs trading strategy) which does the following:

Retrieve a subset of a larger data (financial data: datetime index and stock prices for ~100 stocks) set sitting in a postgres database.
Clean the data (drop stocks with >30% NaN) and calculate returns and indexing (relative to the first observation of each stock)
Find all combinations of the stock pairs, calculate the correlation (actually some measure similar to that, but too important here)
Rank the pairs highest correlation to lowest or only pick pairs with correlation > defined threshold, i.e. 0.9
Check each of these pairs for cointegration, both ways! and rank them according to their test-value
Pick the top n, i.e. 10 pairs, to be traded and calculate some signal based on moving averages and std
Retrieve an "out-of-sample" window and trade the stocks
Record everyday's return (i.e. during 5 days) in a logbook
Calculate some statistics

After these 9 steps, start over again, retrieve another training window and perform the analysis...

My approach would be - and please correct if you see something better:
1. Extract as many functions as possible from the program
2. Loop step 1-9 through multiple training and trading windows

and my resulting question (inspired by many threads here in the forum and i.e. How to make your python code run faster

How can I identify which part of my code can be run in parallel?
Somehow, it does seem trivial to me at all: What technique to apply to "re-write" the code, such that it can use multi-processing?
Also not always obvious: Rewriting loops as functions, any specific angle to always look at?
Does it make sense to "numba.jit()" all functions?
Should I change all formats of my data to float64? what disadvantage could happen? (at the moment they are "standard" numeric)
Is there any checklist from which I can see when a loop can be vectorized?

Please apologize the many - rather conceptual - questions, but I think, if I could understand all above "pain" points, it would really boost my "logical" understanding and it also would be very beneficial for new python joiners.

You have to a couple books. Your question is not a good fit for StackOverflow. Also: **this is *not* a forum!** This is a Q&A site. Which means you post *one specific question* and you get *specific answers for that question*. Forums are about *discussion* (and hence *opinions*). If you are looking for a discussion, please move to an actual forum about programming (i.e. not StackExchange), if not you'll have to greatly reduce the scope of your question (your 6 points might become 20 specific questions about those topics, with example code). — Bakuriu, Mar 31 '18 at 08:38

user3666197 · Accepted Answer · 2018-03-31T12:41:31.163

Non-trivial questions can yield but simplified answers:

Performance improvements needs a lot attention, ~`[man*decades]` ...

Simply put, do not expect to read a few examples and become expert in this.

No. 1: Poor algorithm will never improve just by some (semi-)automated transformation. Smart re-factoring may jump +100% in performance inside the native plain python code ( example below ), yet a well crafted code, matching the close-to-silicon properties of the code-execution device will show such efforts in other light, as the code-improvement of the said ~ +100% performance boom results having almost the same performance once transformed into a jit-compiled unit. This means that pre-mature optimisations may turn out useless on going into carefully engineered high-performance code. At least you have been warned.

python is a great tool, I love it for almost-infinitely precise maths. Yet, carving out both ultimate precision and ultimate performance seems closer to the Heisenberg's principle, than Computer Scientists and the more the fans are keen to admit. Just 've spend long time in doing both to be able to put it compact into a few words paragraph.

Q4: Does it make sense to "`numba.jit()`" all functions?

numba is a great tool for stable code-base, so let's start with it:

Low hanging fruit of automated-transformations are easily cherry-picked with numba.jit() tools. Benchmarking will help you shave-off almost all overheads your code need not carry with.

If depending on code-elements, that the still evolving numba.jit() code-transformers cannot transcode, your are done. Having worked with numba since it's very initial releases, { list | dict | class | ... } are killers for any further dreams of having a code (auto-)converted a bit nearer to the silicon. Also all the there referred functions have to be able to get numba.jit(), so almost forget about having some high-level import-ed code-bases easily translated using numba, if their original code was not systematically designed with numba in mind.

Q5: Should I change all formats of my data to `float64`?
what disadvantage could happen?_{( at the moment they are "standard" numeric )}

float32 has, besides a halved static sizing of the memory-footprint in [SPACE]-domain, a few principal disadvantages.

Some modules ( typically those, that inherit from FORTRAN numerical solvers and similar heritage ) auto-convert any externally passed data into their local float64 replicas ( so both [SPACE] and [TIME] penalty grows, outside of your domain of control ).

Better expect an increased code-execution penalty in [TIME]-domain, as non-aligned cell-boundaries are expensive ( this goes deep into assembly-level of the code and the CPU-instruction sets and cache-hierarchies to master all details on this level ).

May see almost ~ 3x slower execution on float32 in benchmarks below.

Q6: Is there any checklist from which I can see when a loop can be vectorized?

Auto-vectorising transformers are nothing less than a Nobel-prize target.

Some tweaking can be done by smart and handy designers. Do not expect any low-hanging fruit in this domain for any more complex operations, than a trivial broadcasting or some easy to design numpy striding trics.

Professional code-tranformation packages are expensive ( someone has to pay the expertise collected over many [man*years] ) and typically may adjust their ROI only if deployed on scale.

Q1: How can I identify which part of my code can be run in parallel?

You are happy not to have to design your code to run in a true-[PARALLEL], but a "just"-[CONCURRENT] fashion. If someone says parallel, check if the system indeed requires to meet all conditions for a true-[PARALLEL] process-scheduling, in majority of cases a "just"-[CONCURRENT] scheduling is just what the speaker asked for ( details go beyond the scope of this post ). Python GIL-stepped locking prevents any but sub-process-based split of workflow, but at a cost, so harvest your independence of processing, as this will reward your intentions without paying any penalty of additional overhead add-on costs, if going against rules of the overhead-strict Amdahl's Law.

Benchmark, benchmark, benchmark:

The actual cost/effect ratio will have to be validated on respective version of python, numba, CPU / cache architecture, so benchmarking is the only way to confirm any improvement ( at costs ).

The following example shows the saving for a trivial Exponential Moving Average function, implemented in several more or less smart manner.

def plain_EMA_fromPrice(                   N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice_float32(           N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float32_nopython(  N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice_float64(           N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice_float64_nopython(  N_period, aPriceVECTOR ):
    ...    


def plain_EMA_fromPrice2(                  N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float32[:]( int32, float32[:] )" )
def numba_EMA_fromPrice2_float32(          N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float32[:]( int32, float32[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float32_nopython( N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float64[:]( int64, float64[:] )" )
def numba_EMA_fromPrice2_float64(          N_period, aPriceVECTOR ):
    ...
@numba.jit(                                                         "float64[:]( int64, float64[:] )", nopython = True, nogil = True )
def numba_EMA_fromPrice2_float64_nopython( N_period, aPriceVECTOR ):
    ...

Improving performance
from 710 [us] -> 160 [us], by code-re-factoring and memory-alignment, next
downto -> 12 ~ 17 [us] by numba.jit():

>>> aPV_32 = np.arange( 100, dtype = np.float32 )
>>> aPV_64 = np.arange( 100, dtype = np.float32 )

>>> aClk.start();_ = plain_EMA_fromPrice(                  18, aPV_32 );aClk.stop()
715L
723L
712L
722L
975L

>>> aClk.start();_ = plain_EMA_fromPrice(                  18, aPV_64 );aClk.stop()
220L
219L
216L
193L
212L
217L
218L
215L
217L
217L

>>> aClk.start();_ = numba_EMA_fromPrice_float32(          18, aPV_32 );aClk.stop()
199L
15L
16L
16L
17L
13L
16L
12L

>>> aClk.start();_ = numba_EMA_fromPrice_float64(          18, aPV_64 );aClk.stop()
170L
16L
16L
16L
18L
14L
16L
14L
17L

>>> aClk.start();_ = numba_EMA_fromPrice_float64_nopython( 18, aPV_64 );aClk.stop()
16L
17L
17L
16L
12L
16L
14L
16L
15L

>>> aClk.start();_ = plain_EMA_fromPrice2(                 18, aPV_32 );aClk.stop()
648L
654L
662L
648L
647L

>>> aClk.start();_ = plain_EMA_fromPrice2(                 18, aPV_64 );aClk.stop()
165L
166L
162L
162L
162L
163L
162L
162L

>>> aClk.start();_ = numba_EMA_fromPrice2_float32(         18, aPV_32 );aClk.stop()
43L
45L
43L
41L
41L
42L

>>> aClk.start();_ = numba_EMA_fromPrice2_float64(         18, aPV_64 );aClk.stop()
17L
16L
15L
17L
17L
17L
12L

>>> aClk.start();_ = numba_EMA_fromPrice2_float64_nopython( 18, aPV_64 );aClk.stop()
16L
15L
15L
14L
17L
15L

Non-trivial recoding: How to speed up my program? Cython, numba, multiprocessing and numpy?

1 Answers1

Performance improvements needs a lot attention, ~[man*decades] ...

Q4: Does it make sense to "numba.jit()" all functions?

Q5: Should I change all formats of my data to float64?what disadvantage could happen? ( at the moment they are "standard" numeric )