Increase performance of df.rolling(...).apply(...) for large dataframes

Question

Execution time of this code is too long.

df.rolling(window=255).apply(myFunc)

My dataframes shape is (500, 10000).

                   0         1 ... 9999
2021-11-01  0.011111  0.054242 
2021-11-04  0.025244  0.003653 
2021-11-05  0.524521  0.099521 
2021-11-06  0.054241  0.138321 
...

I make the calculation for each date with the last 255 date values. myFunc looks like:

def myFunc(x):
   coefs = ...
   return np.sqrt(np.sum(x ** 2 * coefs))

I tried to use swifter but performances are the same :

import swifter
df.swifter.rolling(window=255).apply(myFunc)

I also tried with Dask, but I think I didn't understand it well because the performance are not much better:

import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)
ddf.execute()

I didn't manage to parallelize the execution with partitions. How can I use dask to improve performance ? I'm on Windows.

We need a piece of dataframe, a function `myFunc` and the result you want to achieve. And so the question is too general. — Сергей Кох, Nov 18 '22 at 09:33
the question is a little bit too broad but if you use linux you can try pandarallel and the parallel_apply function. it uses all cores that you have available but it's a little buggy on windows. — maxxel_, Nov 18 '22 at 09:37
dask.dataframe partitions rowwise. you're doing a rolling operation (so cross-row communication is needed), broadcast across a huge number of columns. This is just the opposite setup to how dask is structured. To use dask in this context, I would manually segment your dataframe into many dataframes in groups of columns, then map the operation across the dataframe. you could use `dask.distributed`'s `client.map` using a threaded scheduler to share the dataframe in memory. if you put together a real [mre] then we could set this up. — Michael Delgado, Nov 18 '22 at 18:45
you could also just do this with numpy or `numba`. seems to be a very straightforward operation along a single axis of a large homogenous array — Michael Delgado, Nov 18 '22 at 18:47

Michael Delgado · Answer 1 · 2022-11-18T20:01:19.897

This can be done using numpy+numba pretty efficiently.

Quick MRE:

import numpy as np, pandas as pd, numba

df = pd.DataFrame(
    np.random.random(size=(500, 10000)),
    index=pd.date_range("2021-11-01", freq="D", periods=500)
)

coefs = np.random.random(size=255)

Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True) and numba.prange:

@numba.njit(parallel=True)
def numba_func(values, coefficients):
    # define result array: size of original, minus length of
    # coefficients, + 1
    result_tmp = np.zeros(
        shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),
        dtype=values.dtype,
    )

    result_final = np.empty_like(result_tmp)

    # nested for loops are your friend with numba!
    # (you must unlearn what you have learned)
    for j in numba.prange(values.shape[1]):
        for i in range(values.shape[0] - len(coefficients) + 1):
            for k in range(len(coefficients)):
                result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]

        result_final[:, j] = np.sqrt(result_tmp[:, j])

    return result_final

This runs very quickly:

In [5]: %%time
   ...: result = pd.DataFrame(
   ...:     numba_func(df.values, coefs),
   ...:     index=df.index[len(coefs) - 1:],
   ...: )
   ...:
   ...:
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms

Note: I'm a huge fan of dask. But the first rule of dask performance is don't use dask. If it's small enough to fit comfortably into memory, you'll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!

padu · Accepted Answer · 2022-11-28T16:05:43.467

First, since you are using numpy functions, specify the parameter raw=True. Toy example:

import pandas as pd
import numpy as np

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 10000)))

%%time
res = df.rolling(250).apply(foo)

Wall time: 359.3 s

# with raw=True
%%time
res = df.rolling(250).apply(foo, raw=True)

Wall time: 15.2 s

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 1000)))

# p_apply - is parallel analogue of apply method
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')

Wall time: 2.2 s

With engine='numba'

%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')

Wall time: 1.2 s

Total speedup is 359/1.2 ~ 300!

Note that the df in the OP’s question has 10 times as many rows, so the timing should be about 10x this for the full problem. — Michael Delgado, Nov 28 '22 at 15:40
Oops sorry I mean 10x the columns. It won’t be exactly linear but it will not be a trivial increase I think — Michael Delgado, Nov 28 '22 at 15:50

Increase performance of df.rolling(...).apply(...) for large dataframes

2 Answers2