optimise zarr array processing

Question

I have a list (mylist) of 80 5-D zarr files with the following structure (T, F, B, Az, El). The array has shape [24x4096x2016x24x8].

I want to extract sliced data and run a probability along some axis using the following function

def GetPolarData(mylist, freq, FreqLo, FreqHi):
    '''
    This function will take the list of zarr files (T, F, B, Az, El), open them, used selected frequency to return an array
    of files with Azimuth and Elevation probabilities
    '''

    ChanIndx = FreqCut(FreqLo, FreqHi,freq)
    
    if len(ChanIndx) != 0:
        MyData = []
        for i in range(len(mylist)):
            print('Adding file {} : {}'.format(i,mylist[i][32:]))
            try:
                zarrf = xr.open_zarr(mylist[i], group = 'arr')
                m = zarrf.master.sum(dim = ['time','baseline'])
                m = m[ChanIndx].sum(dim = ['frequency'])

                c = zarrf.counter.sum(dim = ['time','baseline'])
                c = c[ChanIndx].sum(dim = ['frequency'])

                p = m.astype(float)/c.astype(float)

                MyData.append(p)

            except Exception as e:
                print(e)
                continue

    else:
        print("Something went wrong in Frequency selection")
                
    print("##########################################")
    print("This will be contribution to selected band")
    print("##########################################")

    print(f"Min {np.nanmin(MyData)*100:.3f}%  ")
    print(f"Max {np.nanmax(MyData)*100:.3f}%  ")
    print(f"Average {np.nanmean(MyData)*100:.3f}%  ")
    return(MyData)

If I call the function using the following,

FreqLo = 470.
FreqHi = 854.
MyTVData =np.array(GetPolarData(AllZarrList,Freq, FreqLo, FreqHi))

I find it is taking so long to run (over 3hrs) on a 40 core, 256 GB RAM

Is there a way to make this runs faster?

Thank you

score 0 · Answer 1 · answered Mar 29 '22 at 16:25

It seems like you could take advantage of parallelization here : each array is only read once, and they are all processed independently of each other.

XArray and others may do computation in parallel but for your application, using the multiprocessing library could help sharing the work among different cores more evenly.

The best tool to achieve good performances is the profile library, which can show the most time-consuming parts of your code. I suggest you run it on a single-process version of your code : it will be easier to use.

optimise zarr array processing

1 Answers1