1

The code down below is a contrived example that simulates an actual problem I have that uses multiprocessing to speed up the code. The code is run on Windows 10 64-bit OS, python 3.7.5, and ipython 7.9.0


the transformation functions(these functions will be used to transform arrays in main())

from itertools import product
from functools import partial

from numba import njit, prange
import multiprocessing as mp
import numpy as np

@njit(parallel= True)
def transform_array_c(data, n):

    ar_len= len(data)

    sec_max1= np.empty(ar_len, dtype = data.dtype)
    sec_max2= np.empty(ar_len, dtype = data.dtype)

    for i in prange(n-1):
        sec_max1[i]= np.nan

    for sec in prange(ar_len//n):
        s2_max= data[n*sec+ n-1]
        s1_max= data[n*sec+ n]

        for i in range(n-1,-1,-1):
            if data[n*sec+i] > s2_max:
                s2_max= data[n*sec+i]
            sec_max2[n*sec+i]= s2_max

        sec_max1[n*sec+ n-1]= sec_max2[n*sec]

        for i in range(n-1):
            if n*sec+n+i < ar_len:
                if data[n*sec+n+i] > s1_max:
                    s1_max= data[n*sec+n+i]
                sec_max1[n*sec+n+i]= max(s1_max, sec_max2[n*sec+i+1])

            else:
                break

    return sec_max1  

@njit(error_model= 'numpy', cache= True)
def rt_mean_sq_dev(array1, array2, n):
    msd_temp = np.empty(array1.shape[0])

    K = array2[n-1]

    rs_x= array1[0] - K
    rs_xsq = rs_x *rs_x

    msd_temp[0] = np.nan

    for i in range(1,n):
        rs_x += array1[i] - K
        rs_xsq += np.square(array1[i] - K)
        msd_temp[i] = np.nan

    y_i = array2[n-1] - K
    msd_temp[n-1] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))

    for i in range(n, array1.shape[0]):
        rs_x = array1[i] - array1[i-n]+ rs_x
        rs_xsq = np.square(array1[i] - K) - np.square(array1[i-n] - K) + rs_xsq
        y_i = array2[i] - K

        msd_temp[i] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))

    return msd_temp 

@njit(cache= True)
def transform_array_a(data, n):
    result = np.empty(data.shape[0], dtype= data.dtype)
    alpharev = 1. - 2 / (n + 1)
    alpharev_exp = alpharev

    e = data[0]
    w = 1.

    if n == 2: result[0] = e
    else:result[0] = np.nan

    for i in range(1, data.shape[0]):
        w += alpharev_exp
        e = e*alpharev + data[i]

        if i > n -3:result[i] = e / w
        else:result[i] = np.nan

        if alpharev_exp > 3e-307:alpharev_exp*= alpharev
        else:alpharev_exp=0.

    return result

The multiprocessing part

def func(tup, data):    #<-------------the function to be run among all 
    a_temp= a[tup[2][0]]

    idx1 = a_temp > a[tup[2][1]]
    idx2= a_temp < b[(tup[2][1], tup[1][1])]

    c_final = c[tup[0][1]][idx1 | idx2]
    data_final= data[idx1 | idx2]

    return (tup[0][0], tup[1][0], *tup[2]), c_final[-1] - data_final[-1]

def setup(a_dict, b_dict, c_dict):    #initialize the shared dictionaries
    global a,b,c
    a,b,c = a_dict, b_dict, c_dict

def main(a_arr, b_arr, c_arr, common_len):
    np.random.seed(0)
    data_array= np.random.normal(loc= 24004, scale=500, size= common_len)

    a_size = a_arr[-1] + 1
    b_size = len(b_arr)
    c_size = len(c_arr)

    loop_combo = product(enumerate(c_arr),
                         enumerate(b_arr),
                         (n_tup for n_tup in product(np.arange(1,a_arr[-1]), a_arr) if n_tup[1] > n_tup[0])
                         )
    result = np.zeros((c_size, b_size, a_size -1 ,a_size), dtype = np.float32) 

    ###################################################
    #This part simulates the heavy-computation in the actual problem

    a= {}
    b= {}
    c= {}

    for i in range(1, a_arr[-1]+1):

        a[i]= transform_array_a(data_array, i)
        if i in a_arr:
            for j in b_arr:
                b[(i,j)]= rt_mean_sq_dev(data_array, a[i], i)/data_array *j


    for i in c_arr:
        c[i]= transform_array_c(data_array, i)

    ###################################################    
    with mp.Pool(processes= mp.cpu_count() - 1,
                 initializer= setup,
                 initargs= [a,b,c]
                 ) as pool:
        mp_res= pool.imap_unordered(partial(func, data= data_array),
                                    loop_combo
                                    )

        for item in mp_res:
            result[item[0]] =item[1]


    return result


if __name__ == '__main__':
    mp.freeze_support()

    a_arr= np.arange(2,44,2)
    b_arr= np.arange(0.4,0.8, 0.20)
    c_arr= np.arange(2,42,10)
    common_len= 440000

    final_res= main(a_arr, b_arr, c_arr, common_len)

For performance reasons, multiple shared "read only" dictionaries are used among all processes to reduce the redundant computations(in the actual problem, the total computation time is reduced by 40% after using shared dictionaries among all the processes). However, The ram usage becomes absurdly higher after using shared dictionaries in my actual problem; memory usage in my 6C/12T Windows computer goes from (8.2GB peak, 5.0GB idle) to (23.9GB peak, 5.0GB idle), a little too high of a cost to pay in order to gain 40% speed up.

Is the high ram usage unavoidable when using multiple shared data among processes is a must? What can be done to my code in order to make it as fast as possible while using as low memory as possible?

Thank you in advance


Note: I tried using imap_unordered() instead of map because I heard it is supposed to reduce the memory usage when the input iterable is large, but I honestly can't see an improvement in the ram usage. Maybe I have done something wrong here?


EDIT: Due to the feedback in the answers, I have already changed the heavy computation part of the code such that it looks less dummy and resembles the computation in the actual problem.

user3666197
  • 1
  • 6
  • 50
  • 92
mathguy
  • 1,450
  • 1
  • 16
  • 33
  • You might be able to reduce memory consumption by using [`numpy.memmap`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) arrays. – martineau Nov 04 '19 at 22:48
  • @martineau in what part of my script do you think it is the best place to use `numpy.memmap` arrays? – mathguy Nov 04 '19 at 22:53
  • Replace the 'multiple shared "read only" dictionaries' with them. – martineau Nov 04 '19 at 22:55
  • @martineau I will give it a shot, hopefully the memory usage can be reduced without losing too many performance gain in the memory-speedup trade-off. – mathguy Nov 04 '19 at 23:00
  • Memory-mapping can be quite fast because at the lowest levels the OS's support for them is utilized. – martineau Nov 04 '19 at 23:02
  • @martineau That's awesome. From what I saw in [the example](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) it seems it uses some sort of files to handle the stored arrays. Hopefully they can handle 3-4 50MB shared arrays. – mathguy Nov 04 '19 at 23:09
  • That's actually not all that much, esp if you have a lot of ram. Even if you don't have tons of it, memory-mapped files are handled by the OS's own page-fault handling mechanisms, which are usually highly-optimized. Here's some [additional information](https://pymotw.com/3/mmap/index.html) about them. – martineau Nov 04 '19 at 23:15
  • Just noticed that a new [`multiprocessing.shared_memory`](https://docs.python.org/3/library/multiprocessing.shared_memory.html) class was added in Python 3.8 — so that's another possibility. There's even a example in the documentation that "demonstrates a practical use of the `SharedMemory` class with NumPy arrays". – martineau Nov 05 '19 at 16:26
  • @martineau good to know there is a well documented solution to this kind of problem in python 3.8. Can‘t wait to upgrade to 3.8 next month because for the moment I can’t upgrade it just yet. – mathguy Nov 05 '19 at 18:24
  • @martineau With all due respect, the O/P does not actively use any step, that would justify the costs of ( NUMA-expensive and any-to-any cross-core coordinated ) sharing of any published data-structure above. The solution of the HPC / computing problems is not in finding an easy way to pick the lowest-hanging-fruit ( by typing a cosy syntax-constructor that seems simple, is legal, yet is also **very inefficient** during run-time ), but rather in designing code tailor-made to achieving the very opposite - **to achieve the ultimate performance in run-time**. This is the problem solving step. – user3666197 Nov 05 '19 at 19:12
  • @user3666197: Likewise, with all due respect, I'm just continuing to address the titular question — which is a valid one, and is likely why most folks will even look at this question. – martineau Nov 05 '19 at 19:24
  • @mathguy I completely forgot to come back to this, are you still open to/looking for answers? – AMC Nov 12 '19 at 03:58
  • @AlexanderCécile yes, I am still looking for answer(s) with code examples. – mathguy Nov 12 '19 at 04:00
  • @mathguy Great, I’ll come back and take a look tomorrow! Do you think it might be worth creating a chat room for this question, since there is probably going to be a good amount of discussion? – AMC Nov 12 '19 at 04:04
  • @AlexanderCécile yes this will definitely be helpful. – mathguy Nov 12 '19 at 04:06
  • @mathguy Okay, are you currently making one or should I do it? – AMC Nov 12 '19 at 04:10
  • @AlexanderCécile I can make one – mathguy Nov 12 '19 at 04:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202179/discussion-between-alexander-cecile-and-mathguy). – AMC Nov 12 '19 at 04:15
  • @mathguy Nevermind, don’t bother! SO finally gave me the option to do so automatically. – AMC Nov 12 '19 at 04:15

1 Answers1

2

High Memory Usage when manipulating shared dictionaries in python multiprocessing run in Windows

It is fair to demystify a bit the problem, before we move into details - there are no shared dictionaries in the original code, the less they get manipulated ( yes, each of the a,b,c did get "assigned" to a reference to the dict_a, dict_b, dict_c yet none of them is shared, but just get replicated as the multiprocessing does in Windows-class O/S-es. No writes "into" dict-s ( just non-destructive reads-from either of their replicas )

Similarly, the np.memmap()-s are possible to put some part of the originally proposed data onto disk-space ( at a cost of doing so + bearing some ( latency-masked ) random-reads latency of ~ 10 [ms] instead of ~ 0.5 [ns] if smart-aligned vectorised memory-patterns were designed into the performance hot-spot ) yet no dramatic change-of-paradigm ought be expected here, as the "external iterator" almost avoids any smart-aligned cache re-uses

Q : What can be done to my code in order to make it as fast as possible while using as low memory as possible?

The first sin was in using an 8B-int64 to store one plain Bbit ( no Qbits here yet ~ All salutes to Burnaby Quantum R&D Teams )

for i in c_arr:                                    # ~~ np.arange( 2, 42, 10 )
     np.random.seed( i )                           # ~ yields a deterministic state
     c[i] = np.random.poisson( size = common_len ) # ~ 440.000 int64-s with {0|1}

This took 6 (processes) x 440000 x 8B ~ 0.021 GB "smuggled" in all copies of dictionary c, whereas each and every such value is deterministically known and could be generated ALAP inside a respective target process, by just knowing the value of i ( indeed no need to pre-generate and many-times replicate ~ 0.021 GB of data )

So far, the Windows-class O/S lack an os.fork() and thus do a python full-copy ( yes, RAM ..., yes, TIME ) of as many replicated python-interpreter sessions ( plus importing the main module ) as was requested, in multiprocessing for process-based separation ( doing that for avoiding a GIL-lock ordered, pure-[SERIAL], code execution )


The Best Next Step:
re-factor the code
for both efficiency and performance

The best next step - refactor the code, so as to minimise a "shallow" ( and expensive ) use of the 6-processes but "externally"-commanded by a central iterator ( the loop_combo "dictator" with ~ 18522 items to repeat the call to a "remotely-dispatched" func( tup, data ) so as to fetch a simple "DMA-tuple"-( (x,y,z), value ) to store one value into a central process result-float32-array ).

Try to increase the computing "density" - so try to re-factor the code by a divide-and-conquer manner ( i.e., that each of the mp.pool-processes computes in one smooth block some remarkably sized, dedicated sub-space of the parameter-space covered ( here iteratively "from ouside" ) and may easily reduce the returned blocks of results. Performance will only improve by doing this ( best without any form of expensive sharing ).

This re-factoring will avoid parameter pickle/unpickle-costs ( add-on overheads - both the one-time ( in passing the unique parameter-set values ) and the repetitive ( in about a ~ 18522-times executed repetitive memory-allocation, buildup and pickle/unpickle-costs of an np.arange( 440000 ) due to a poor call-signature design / engineering )

All these steps will improve your processing efficiency and reduce the there unnecessary RAM-allocations.

Community
  • 1
  • 1
user3666197
  • 1
  • 6
  • 50
  • 92
  • The first sin was due to the fact that I was trying to generate a bunch of arrays as computationally expensive as the actual array_transformation function I use in the actual problem where I use different weighted moving averages in a,b,c. The i value in this contrived problem corresponds to the `moving_window_size` because a slight difference in `i`( `moving_window_size` ) will lead to a complete different array. Maybe I should rewrite this part in my post to truly reflect what the array_transformation function does. – mathguy Nov 05 '19 at 06:39
  • @mathguy Please do update your code if you can, I would love to take a look at this later :) – AMC Nov 05 '19 at 06:40
  • @AlexanderCécile greatly appreciate the help from you guys – mathguy Nov 05 '19 at 06:42
  • 1
    Pretty good answer, although a bit hard to digest. A brief summary at the end would be a great addition. – martineau Nov 05 '19 at 08:16
  • I'm still confused about "( i.e., that each of the `mp.pool`-processes computes in one smooth block some remarkably sized, dedicated sub-space of the parameter-space covered......)" From what I understand, it seems I need to make each worker in `mp.pool` computes exactly one `smooth` block of parameter subspace( so 6 `smooth` blocks in total). Then how can I make `mp.pool` do that? Every time I see guidelines in multiprocessing just tells me to throw the function and iterable in and call it a day. I am struggling to find clear examples about making manual smooth block to throw into `mp.pool`. – mathguy Nov 05 '19 at 12:42
  • @martineau Thanks for both your compliment and advice. Small World again here. FORTRAN IV manual was my 2nd Bible ( The 1st was the HP-67 one - a very educational book, not just about programming, it was full of stories about design-thinking - from Moon Lander simulator ( sure, not as complex as the AGC-ferrite-ROM magic flewn to the Mare Tranquillitatis inside the Eagle ). OMG, the time is so fast. Anyway, thank you for reminding me to those lovely days, Martin. **Let me wish you all the best, Man :o)** – user3666197 Nov 05 '19 at 13:51