Performing integer-based rolling window grouby using Python Polars

Question

I have a outer/inner loop-based function I'm trying to vectorise using Python Polars DataFrames. The function is a type of moving average and will be used to filter time-series financial data. Here's the function:

def ma_j(df_src: pl.DataFrame, depth: float):

    jrc04 = 0.0
    jrc05 = 0.0
    jrc06 = 0.0
    jrc08 = 0.0

    series = df_src['close']

    for x in range(0, len(series)):
        if x >= x - depth*2:
            for k in np.arange(start=math.ceil(depth), stop=0, step=-1):
                jrc04 = jrc04 + abs(series[x-k] - series[x-(k+1)])
                jrc05 = jrc05 + (depth + k) * abs(series[x-k] - series[x-(k+1)])
                jrc06 = jrc06 + series[x-(k+1)]
        else:
            jrc03 = abs(series - (series[1]))
            jrc13 = abs(series[x-depth] - series[x - (depth+1)])
            jrc04 = jrc04 - jrc13 + jrc03
            jrc05 = jrc05 - jrc04 + jrc03 * depth
            jrc06 = jrc06 - series[x - (depth+1)] + series[x-1]
        jrc08 = abs(depth * series[x] - jrc06)

    if jrc05 == 0.0:
        ma = 0.0
    else:
        ma = jrc08/jrc05

    return ma

The tricky bit for me are multiple the inner loop look-backs (for k in...). I've looked through multiple examples that use groupby_dynamic on the timeseries data. For example, here. I've also seen an example for groupby_rolling, but this still seems to use a period.

However, I'd like to strip away the timeseries and just use source Series. Does this mean I need to group on an integer range?

Using this data example:

import polars as pl
import numpy as np

i, t, v = np.arange(0, 50, 1), np.arange(0, 100, 2), np.random.randint(1,101,50)
df = pl.DataFrame({"i": i, "t": t, "rand": v})
df = df.with_column((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
cols = ["i", "datetime", "rand"]
df = df.select(cols)

DataFrame looks like this:

shape: (50, 3)
┌─────┬─────────────────────┬──────┐
│ i   ┆ datetime            ┆ rand │
│ --- ┆ ---                 ┆ ---  │
│ i32 ┆ datetime[μs]        ┆ i32  │
╞═════╪═════════════════════╪══════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 27   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ 2022-10-30 00:00:02 ┆ 82   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 2022-10-30 00:00:04 ┆ 22   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ 2022-10-30 00:00:06 ┆ 58   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ...                 ┆ ...  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 46  ┆ 2022-10-30 00:01:32 ┆ 39   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 47  ┆ 2022-10-30 00:01:34 ┆ 48   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 48  ┆ 2022-10-30 00:01:36 ┆ 26   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 49  ┆ 2022-10-30 00:01:38 ┆ 53   │
└─────┴─────────────────────┴──────┘

...I can do a grouping by datetime like this":

df.groupby_dynamic("datetime", every="10s").agg([
    pl.col("v0").mean().alias('rolling mean')
])

which gives this:

But there's 3 issues with this:

I don't want to group of datetime...I want to group on every row (maybe i?) in bins of [x] size.
I need values against every row
I would like to define the aggregation function, as per the various cases in the function above

Any tips on how I could attack this using Polars? Thanks.

---------- Edit 1

Following @ritchie46 's awesome advice (thanks mate!), here's the groupby:

result_grp = (
    df
    .groupby_rolling(index_column="i", period="10i")
    .agg(
        [
            pl.count().alias("rolling_slots"),
            pl.col("rand").mean().alias("roll_mean")
        ]
    )
)

df2 = df.select(
    [
        pl.all(),
        result_grp.get_column("rolling_slots"),
        result_grp.get_column("roll_mean"),
    ]
)

This now gives:

shape: (50, 5)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┐
│ i   ┆ datetime            ┆ rand ┆ rolling_slots ┆ roll_mean │
│ --- ┆ ---                 ┆ ---  ┆ ---           ┆ ---       │
│ i32 ┆ datetime[μs]        ┆ i32  ┆ u32           ┆ f64       │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 55   ┆ 1             ┆ 55.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ 2022-10-30 00:00:02 ┆ 52   ┆ 2             ┆ 53.5      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2022-10-30 00:00:04 ┆ 30   ┆ 3             ┆ 45.666667 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 2022-10-30 00:00:06 ┆ 63   ┆ 4             ┆ 50.0      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...                 ┆ ...  ┆ ...           ┆ ...       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 46  ┆ 2022-10-30 00:01:32 ┆ 51   ┆ 10            ┆ 68.3      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 47  ┆ 2022-10-30 00:01:34 ┆ 94   ┆ 10            ┆ 69.6      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 48  ┆ 2022-10-30 00:01:36 ┆ 26   ┆ 10            ┆ 68.6      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 49  ┆ 2022-10-30 00:01:38 ┆ 56   ┆ 10            ┆ 64.8      │
└─────┴─────────────────────┴──────┴───────────────┴───────────┘

This is great; now instead of mean(), how do I apply a custom function on the grouped values, such as:

f_jparams(depth_array, jrc04, jrc05, jrc06, jrc08):
    
    _depth = len(depth_array)
    
    if len(depth_array) > 3:
        for x in np.arange(start=1, stop=len(depth_array), step=1):
            jrc04 = jrc04 + abs(depth_array[x] - depth_array[x-1])
            jrc05 = jrc05 + (_depth+x) * abs(depth_array[x] - depth_array[x-1])
            jrc06 = jrc06 + depth_array[x-1]
    else:
        jrc03 = abs(depth_array[_depth-1] - depth_array[_depth-2])
        jrc13 = abs(depth_array[0] - depth_array[1])
        jrc04 = jrc04 - jrc13 + jrc03
        jrc05 = jrc05 - jrc04 + jrc03*_depth
        jrc06 = jrc06 - depth_array[1] + depth_array[_depth-2]
        
    jrc08 = abs(_depth * depth_array[0] - jrc06)
    
    if jrc05 == 0.0:
        ma = 0.0
    else:
        ma = jrc08/jrc05
        
    return ma, jrc04, jrc05, jrc06, jrc08

Thanks!

---- Edit 2:

Thanks to this post, I can collect up the items in the rand rolling group into a list for each row:

depth = 10

result_grp = (
    df
    .groupby_rolling(
        index_column="i", 
        period=str(depth) + "i",
        # offset="0i",
        # closed="left"
    )
    .agg(
        [
            pl.count().alias("rolling_slots"),
            pl.col("rand").mean().alias("roll_mean"),
            pl.col("rand").suffix('_val_list'),
        ]
    )
)

df2 = df.select(
    [
        pl.all(),
        result_grp.get_column("rolling_slots"),
        result_grp.get_column("roll_mean"),
        result_grp.get_column("rand_val_list"),
    ]
)

Also from this post, I saw a way to make the rolling window period a variable; nice!

Is there a way to use get_columns and exclude together so I don't have to list every col I want?

The dataframe now looks like:

shape: (50, 6)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┬──────────────────┐
│ i   ┆ datetime            ┆ rand ┆ rolling_slots ┆ roll_mean ┆ rand_val_list    │
│ --- ┆ ---                 ┆ ---  ┆ ---           ┆ ---       ┆ ---              │
│ i32 ┆ datetime[μs]        ┆ i32  ┆ u32           ┆ f64       ┆ list[i32]        │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╪══════════════════╡
│ 0   ┆ 2022-10-30 00:00:00 ┆ 64   ┆ 1             ┆ 64.0      ┆ [64]             │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ 2022-10-30 00:00:02 ┆ 80   ┆ 2             ┆ 72.0      ┆ [64, 80]         │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2022-10-30 00:00:04 ┆ 23   ┆ 3             ┆ 55.666667 ┆ [64, 80, 23]     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 2022-10-30 00:00:06 ┆ 30   ┆ 4             ┆ 49.25     ┆ [64, 80, ... 30] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...                 ┆ ...  ┆ ...           ┆ ...       ┆ ...              │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 46  ┆ 2022-10-30 00:01:32 ┆ 25   ┆ 8             ┆ 22.625    ┆ [38, 32, ... 25] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 47  ┆ 2022-10-30 00:01:34 ┆ 69   ┆ 8             ┆ 26.5      ┆ [32, 12, ... 69] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 48  ┆ 2022-10-30 00:01:36 ┆ 72   ┆ 8             ┆ 31.5      ┆ [12, 3, ... 72]  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 49  ┆ 2022-10-30 00:01:38 ┆ 100  ┆ 8             ┆ 42.5      ┆ [3, 26, ... 100] │
└─────┴─────────────────────┴──────┴───────────────┴───────────┴──────────────────┘

Should I just now resort back to looping through the rand_val_list column and send each grouped values list to my function? Or is there a better polars way?

Thanks again!

Sorry for the images...I can't get my df to print out nicely inline with text. — Paul, Jan 02 '23 at 11:45
To combine the result - it looks like you want to `.join()` on `id`? e.g. `df.join(df.groupby_rolling(..).agg(...), on="id")` -- For `rand_val_list` you can use `.arr.eval()` process list columns. Perhaps [the answers here](https://stackoverflow.com/questions/72689281/index-operation-on-list-column-data-in-polars) are helpful. — jqurious, Jan 03 '23 at 13:59

score 1 · Accepted Answer · answered Jan 02 '23 at 15:51

Are you searching for periods="10i"? Polars groupby_rolling accepts a period argument with the following query language:

        - 1ns   (1 nanosecond)
        - 1us   (1 microsecond)
        - 1ms   (1 millisecond)
        - 1s    (1 second)
        - 1m    (1 minute)
        - 1h    (1 hour)
        - 1d    (1 day)
        - 1w    (1 week)
        - 1mo   (1 calendar month)
        - 1y    (1 calendar year)
        - 1i    (1 index count)

Where i is simply the number of indices/rows.

So on your data a rolling groupby where we count the number of slots would give:

(df.groupby_rolling(index_column="i", period="10i")
   .agg([
       pl.count().alias("rolling_slots")
   ])
)

run GroupbyRollingExec
shape: (50, 2)
┌─────┬───────────────┐
│ i   ┆ rolling_slots │
│ --- ┆ ---           │
│ i64 ┆ u32           │
╞═════╪═══════════════╡
│ 0   ┆ 1             │
│ 1   ┆ 2             │
│ 2   ┆ 3             │
│ 3   ┆ 4             │
│ ... ┆ ...           │
│ 46  ┆ 10            │
│ 47  ┆ 10            │
│ 48  ┆ 10            │
│ 49  ┆ 10            │
└─────┴───────────────┘

Yes, thanks Ritchie. This gives me the grouping I'm after, where I can use `depth` as the window size. Do you have any thoughts on how I can execute a custom function on the grouped values? — Paul, Jan 02 '23 at 20:27

score 0 · Answer 2 · edited Jan 02 '23 at 17:10

def ma_j(df_src: pl.DataFrame, depth: float):
    jrc04 = 0.0
    jrc05 = 0.0
    jrc06 = 0.0
    jrc08 = 0.0

    series = df_src['close']

    for x in range(0, len(series)):
        if x >= x - depth*2:
            for k in np.arange(start=math.ceil(depth), stop=0, step=-1):
                jrc04 = jrc04 + abs(series[x-k] - series[x-(k+1)])
                jrc05 = jrc05 + (depth + k) * abs(series[x-k] - series[x-(k+1)])
                jrc06 = jrc06 + series[x-(k+1)]
        else:
            jrc03 = abs(series - (series[1]))
            jrc13 = abs(series[x-depth] - series[x - (depth+1)])
            jrc04 = jrc04 - jrc13 + jrc03
            jrc05 = jrc05 - jrc04 + jrc03 * depth
            jrc06 = jrc06 - series[x - (depth+1)] + series[x-1]
        jrc08 = abs(depth * series[x] - jrc06)

    if jrc05 == 0.0:
        ma = 0.0
    else:
        ma = jrc08/jrc05

    return ma

Your answer could be improved by adding more information on what the code does and how it helps the OP. — Tyler2P, Jan 02 '23 at 17:09

Performing integer-based rolling window grouby using Python Polars

---------- Edit 1

---- Edit 2:

2 Answers2