How to decrease execution time of standard deviation calculated azimuthally

Question

I am using the below code to calculate standard deviation azimuthally for 2D array converted from images. However, the code takes couple minutes to run. Can anyone suggest how this can be made faster?

import numpy as np

def radial_profile(data, center):
    y, x = np.indices((data.shape))

    r = np.sqrt((x - center[0])**2 + (y - center[1])**2)
    r = r.astype(np.int64)

    radialprofile = np.zeros(np.max(r) + 1)
    for i in range (np.max(r) + 1):
        radialprofile[i] = np.std(data[r == i])

    return radialprofile

data = random.randint(10, size=(4096, 4096))
std_azi = radial_profile(data, [2048,2048])

CPU times: total: 1min 1s
Wall time: 1min 1s

As you can see in the CPU and Wall time, it takes at least a minute to run this. I have 10,000 such astronomical images that needs to be processed. So, any suggestion on how this code can be made faster would be highly appreciated.

are all images same size? is `center` constant? if yes, we could use this fact to cache `r == i`. Most of the function time is spent in this comparison — dankal444, Jun 23 '23 at 19:06

score 3 · Accepted Answer · answered Jun 23 '23 at 19:47

The main issue is that data[r == i] is done >2000 times. This is very inefficient especially since data needs is quite big.

One could use a group-by strategy so to do this all at once. That being said, Numpy does not support this. AFAIK, there are packages to do that but we can design a more efficient implementation based on the specific properties of this code (e.g. the group indices are relatively small integers). We can implement this in Numba. We need to do a two pass implementation for sake of accuracy.

Here is the resulting code:

import numpy as np
import numba as nb

@nb.njit('(int32[:,::1], UniTuple(int32,2))')
def radial_profile(data, center):
    dh, dw = data.shape
    cx, cy = center
    rMax = np.int64(np.sqrt(max(cx**2 + cy**2, cx**2+(dh-cy)**2, (dw-cx)**2+cy**2, (dw-cx)**2+(dh-cy)**2)))

    # Step 1: computes the mean by group
    sumByGroup = np.zeros(rMax + 1)
    groupSize = np.zeros(rMax + 1, dtype=np.int64)
    for y in range(dh):
        for x in range(dw):
            r = np.int64(np.sqrt((x - cx)**2 + (y - cy)**2))
            sumByGroup[r] += data[y, x]
            groupSize[r] += 1
    meanByGroup = sumByGroup / groupSize

    # Step 2: computes the variance by group based on the mean by group
    sqSumByGroup = np.zeros(rMax + 1)
    for y in range(dh):
        for x in range(dw):
            r = np.int64(np.sqrt((x - cx)**2 + (y - cy)**2))
            sqSumByGroup[r] += (data[y, x] - meanByGroup[r])**2
    varianceByGroup = sqSumByGroup / groupSize

    return np.sqrt(varianceByGroup)

std_azi = radial_profile(data, (2048,2048))

Performance results

Here is results on my machine with a i5-9600KF processor:

Initial code:  33154 ms
New code:         63 ms

The new code is 526 times faster than the initial one.

Note that this code is sequential. Each image can be computed in parallel. The resulting parallel code will scale very well since this code is clearly compute-bound.

I was curious and checked why you not cache `r = np.int64(np.sqrt((x - cx)**2 + (y - cy)**2))`. Found it strange that calculating sqrt (and other things) is faster than caching — dankal444, Jun 23 '23 at 20:10
Good question! This is because RAM is slow and `sqrt` can be computed directly by dedicated hardware units (I think Numba is not able to use SIMD instructions but scalar ones which not much slower). Recent Intel CPUs can compute 1 `sqrt` every 6 cycles, or 4 `sqrt` every 12 cycles! Note the int/float conversions are quite expensive. Meanwhile, the cached array needs to be computed, stored, and then read twice from memory. The stores will be pretty expensive because of page faults (and low-level CPU cache allocation effects). — Jérôme Richard, Jun 23 '23 at 21:17
When run in parallel, the sqrt computation will nearly perfectly scale while the RAM will quickly be saturated (only few core can saturate it). In the end, caching can be faster on some machines in sequential, but almost never in parallel. On top of that, the flop/byte ratio of computing units tends to increase over time -- See [memory wall](https://en.wikipedia.org/wiki/Random-access_memory#Memory_wall). Thus, compute-bound codes will certainly run even faster on future machine compared to memory-bound ones. Recomputing data is not so rare in new HPC codes because of that, especially on GPUs. — Jérôme Richard, Jun 23 '23 at 21:23

How to decrease execution time of standard deviation calculated azimuthally

1 Answers1

Performance results