pandas rolling window parallelize problem while using numba engine

Question

I have huge dataframe and I need to calculate slope using rolling windows in pandas. The code below works fine but looks like numba is not able to parallelize it. Any other way to parallelize it or make it more efficient?

def slope(x):
    length = len(x)
    if length < 2:
       return np.nan
    slope = (x[-1] - x[0])/(length -1)
    return slope

df = pd.DataFrame({"id":[1,1,1,1,1,2,2,2,2,2,2], 'a': [1,3,2,4,5,6,3,5,8,12,30], 'b':range(10,21)})
df.groupby('id', as_index=False).rolling(min_periods=2, window=5).apply(slope, raw = True, engine="numba", engine_kwargs={"parallel": True})

I get the following warning message :

The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning .....

score 0 · Answer 1 · answered Apr 15 '23 at 11:49

parallel=True is meant to do two things:

it enables codes containing parallel loop manually parallelized using prange to actually run in parallel;
it automatically parallelize operations that are known to have a parallel semantics. To quote the documentation:

Some operations inside a user defined function, e.g. adding a scalar value to an array, are known to have parallel semantics. A user program may contain many such operations and while each operation could be parallelized individually, such an approach often has lackluster performance due to poor cache behavior. Instead, with auto-parallelization, Numba attempts to identify such operations in a user program, and fuse adjacent ones together, to form one or more kernels that are automatically run in parallel.

That being said, Numba complains when there is no prange because the code is not explicitly parallelized. Explicit parallelization is often more efficient, especially in non trivial cases. It can help to reduce the number of temporary arrays that make application not scale (due to allocations and the use of the slow shared caches/DRAM). This also ensure the function is actually parallelized (since the automatic processes may simply fail in non-trivial cases resulting in a sequential code).

For example, here, you can remove this warning by just using a manual for loop with prange. Using a loop reduce the number of temporary array by one so it should make the code probably faster on machines where the memory bandwidth is limited (i.e. most machines). You can also optimize the code by precomputing 1.0/(length-1) and multiply the array by this precomputed value (this is not done by default unless fast-math is enabled which is also known to be quite unsafe since it breaks the IEEE-754 standard).

pandas rolling window parallelize problem while using numba engine

1 Answers1