1

I have a large dataframe that I am doing a groupby_dynamic operation on, and then doing an aggregation that combined the columns in each group, down to a single boolean value.

Ultimately, I have to do this operation many times, as it is part of a non-gradient solver loop. On each pass, some paramters to the aggregation are different, but the time-based groups are always the same.

This is a short example of my dataframe looks like:

    ┌────────┬─────────────────────────┬───────┬───────┬───────┐
    │ person ┆ Recorded                ┆ read1 ┆ read2 ┆ read3 │
    │ ---    ┆ ---                     ┆ ---   ┆ ---   ┆ ---   │
    │ str    ┆ datetime[μs]            ┆ f64   ┆ f64   ┆ f64   │
    ╞════════╪═════════════════════════╪═══════╪═══════╪═══════╡
    │ A      ┆ 2020-08-21 04:43:17.910 ┆ 0.7   ┆ 2.3   ┆ 3.1   │
    │ A      ┆ 2020-08-21 06:10:44.990 ┆ 3.7   ┆ 3.7   ┆ 4.1   │
    │ A      ┆ 2020-08-21 08:41:48.380 ┆ -0.9  ┆ 0.3   ┆ 0.8   │
    │ A      ┆ 2020-08-21 10:41:26.870 ┆ -0.9  ┆ 0.3   ┆ 0.7   │
    │ A      ┆ 2020-08-26 02:20:05.480 ┆ 0.8   ┆ 3.2   ┆ 5.4   │
    │ B      ┆ 2020-08-26 03:56:14.780 ┆ -0.7  ┆ 0.8   ┆ 1.9   │
    │ B      ┆ 2020-08-26 05:49:38.260 ┆ -1.1  ┆ 0.2   ┆ 0.9   │
    │ B      ┆ 2020-08-26 07:38:26.990 ┆ -1.0  ┆ 0.5   ┆ 1.4   │
    │ B      ┆ 2020-08-26 09:31:52.780 ┆ -0.5  ┆ 2.1   ┆ 4.0   │
    │ B      ┆ 2020-08-27 03:08:07.550 ┆ -0.8  ┆ 0.6   ┆ 1.4   │
    └────────┴─────────────────────────┴───────┴───────┴───────┘

And the operation is something like:

res = (
    df_tbl.lazy()
    .groupby_dynamic(index_column='Recorded', by='person',
                     period=window, offset=-window, every='1h', closed='both' )
    .agg(
        expr_func(pl.col('read1'), param1, param2)  # kernel applied here
    )
    .select(
        [
        pl.col('Patient_ID'),
        # realign group time with RIGHT edge of query window
        pl.col('Recorded').dt.offset_by( f"{int(window.total_seconds())}s" ).alias('t_query'),  
        pl.col('read').alias('ind')
        ]
    ).collect()  ## end of lazy eval
)

Where my expr_func uses columnar operations for logic and sum aggregation, like:

def expr_func(col1, N, Z):
    return ((col >= Z).sum() >= N).cast(pl.Int32)

Profiling my actual operation, I see that a large amount of time is spent actually doing the groupby_dyamic:

    ┌───────────────────────────────────┬────────┬────────┐
    │ node                              ┆ start  ┆ end    │
    │ ---                               ┆ ---    ┆ ---    │
    │ str                               ┆ u64    ┆ u64    │
    ╞═══════════════════════════════════╪════════╪════════╡
    │ optimization                      ┆ 0      ┆ 978    │
    │ groupby_dynamic(person)           ┆ 978    ┆ 128170 │
    │ projection(person, t_query, …     ┆ 128337 ┆ 136391 │
    └───────────────────────────────────┴────────┴────────┘

Is there a way around this, where I can do the groupby once, then do my aggregation after-the-fact?

I can, indeed, just do the aggregation into lists:

(
    df_tbl.lazy()
    .groupby_dynamic(index_column='Recorded', by='person',
                     period=window, offset=-window, every='1h', closed='both' )
    .agg(
        [pl.col('read1'), pl.col('read2'), pl.col('read3')]
    ).collect()
)

Which gives me the output in the form I expect:

    ┌────────┬─────────────────────┬──────────────────┬─────────────────┬─────────────────┐
    │ person ┆ Recorded            ┆ read1            ┆ read2           ┆ read3           │
    │ ---    ┆ ---                 ┆ ---              ┆ ---             ┆ ---             │
    │ str    ┆ datetime[μs]        ┆ list[f64]        ┆ list[f64]       ┆ list[f64]       │
    ╞════════╪═════════════════════╪══════════════════╪═════════════════╪═════════════════╡
    │ A      ┆ 2020-08-20 11:00:00 ┆ [0.7]            ┆ [2.3]           ┆ [3.1]           │
    │ A      ┆ 2020-08-20 12:00:00 ┆ [0.7]            ┆ [2.3]           ┆ [3.1]           │
    │ A      ┆ 2020-08-20 13:00:00 ┆ [0.7, 3.7]       ┆ [2.3, 3.7]      ┆ [3.1, 4.1]      │
    │ A      ┆ 2020-08-20 14:00:00 ┆ [0.7, 3.7]       ┆ [2.3, 3.7]      ┆ [3.1, 4.1]      │
    │ A      ┆ 2020-08-20 15:00:00 ┆ [0.7, 3.7, -0.9] ┆ [2.3, 3.7, 0.3] ┆ [3.1, 4.1, 0.8] │
    │ A      ┆ 2020-08-20 16:00:00 ┆ [0.7, 3.7, -0.9] ┆ [2.3, 3.7, 0.3] ┆ [3.1, 4.1, 0.8] │
    │ …      ┆ …                   ┆ …                ┆ …               ┆ …               │
    │ B      ┆ 2020-08-26 22:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    │ B      ┆ 2020-08-26 23:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    │ B      ┆ 2020-08-27 00:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    │ B      ┆ 2020-08-27 01:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    │ B      ┆ 2020-08-27 02:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    │ B      ┆ 2020-08-27 03:00:00 ┆ [-0.8]           ┆ [0.6]           ┆ [1.4]           │
    └────────┴─────────────────────┴──────────────────┴─────────────────┴─────────────────┘

But now, I don't have a concept of how I can do operations on those lists in each row, using the pl.Expr syntax, or something similar, that I used in expr_func above.

That is, how to do the "all at once" operations, as-if columnar, per group, on the list values.

I tried to do something like adding a groupnum enumerating the groups, then exploding on that, then re-grouping by that, and doing the agg as in the original implementation ... but I could not get the explode to work on an int32 value columns :/

I did not even go down the route of using apply, since it is always disastrously slow.

Any help on this would be appreciated and really beneficial. This is just a toy version of the problem, and ultimately a high-parameter (neural network) parameter space will be searched with the non-gradient solver, so major speedups will mean a lot.

Chayim Friedman
  • 47,971
  • 5
  • 48
  • 77
Andrew P.
  • 31
  • 4
  • Maybe you could use `polars.DataFrame.upsample` to generate your DataFrame first. It probably makes sense anyway because groupby_dynamic is taking the most amouht of time. – skedaddle_waznook Jul 23 '23 at 15:54
  • @skedaddle_waznook That is not quite the operation that I am trying to do here. Instead, I am just searching within a sliding window, and expect to get a varying number of points within each window. That actually comes up in the "kernel" function that I am applying, where the count of points implicitly matters, via the `>= N` part (with the threshold condition, too). – Andrew P. Jul 23 '23 at 16:05

0 Answers0