I have a large dataframe that I am doing a groupby_dynamic
operation on, and then doing an aggregation that combined the columns in each group, down to a single boolean value.
Ultimately, I have to do this operation many times, as it is part of a non-gradient solver loop. On each pass, some paramters to the aggregation are different, but the time-based groups are always the same.
This is a short example of my dataframe looks like:
┌────────┬─────────────────────────┬───────┬───────┬───────┐
│ person ┆ Recorded ┆ read1 ┆ read2 ┆ read3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═════════════════════════╪═══════╪═══════╪═══════╡
│ A ┆ 2020-08-21 04:43:17.910 ┆ 0.7 ┆ 2.3 ┆ 3.1 │
│ A ┆ 2020-08-21 06:10:44.990 ┆ 3.7 ┆ 3.7 ┆ 4.1 │
│ A ┆ 2020-08-21 08:41:48.380 ┆ -0.9 ┆ 0.3 ┆ 0.8 │
│ A ┆ 2020-08-21 10:41:26.870 ┆ -0.9 ┆ 0.3 ┆ 0.7 │
│ A ┆ 2020-08-26 02:20:05.480 ┆ 0.8 ┆ 3.2 ┆ 5.4 │
│ B ┆ 2020-08-26 03:56:14.780 ┆ -0.7 ┆ 0.8 ┆ 1.9 │
│ B ┆ 2020-08-26 05:49:38.260 ┆ -1.1 ┆ 0.2 ┆ 0.9 │
│ B ┆ 2020-08-26 07:38:26.990 ┆ -1.0 ┆ 0.5 ┆ 1.4 │
│ B ┆ 2020-08-26 09:31:52.780 ┆ -0.5 ┆ 2.1 ┆ 4.0 │
│ B ┆ 2020-08-27 03:08:07.550 ┆ -0.8 ┆ 0.6 ┆ 1.4 │
└────────┴─────────────────────────┴───────┴───────┴───────┘
And the operation is something like:
res = (
df_tbl.lazy()
.groupby_dynamic(index_column='Recorded', by='person',
period=window, offset=-window, every='1h', closed='both' )
.agg(
expr_func(pl.col('read1'), param1, param2) # kernel applied here
)
.select(
[
pl.col('Patient_ID'),
# realign group time with RIGHT edge of query window
pl.col('Recorded').dt.offset_by( f"{int(window.total_seconds())}s" ).alias('t_query'),
pl.col('read').alias('ind')
]
).collect() ## end of lazy eval
)
Where my expr_func
uses columnar operations for logic and sum aggregation, like:
def expr_func(col1, N, Z):
return ((col >= Z).sum() >= N).cast(pl.Int32)
Profiling my actual operation, I see that a large amount of time is spent actually doing the groupby_dyamic
:
┌───────────────────────────────────┬────────┬────────┐
│ node ┆ start ┆ end │
│ --- ┆ --- ┆ --- │
│ str ┆ u64 ┆ u64 │
╞═══════════════════════════════════╪════════╪════════╡
│ optimization ┆ 0 ┆ 978 │
│ groupby_dynamic(person) ┆ 978 ┆ 128170 │
│ projection(person, t_query, … ┆ 128337 ┆ 136391 │
└───────────────────────────────────┴────────┴────────┘
Is there a way around this, where I can do the groupby once, then do my aggregation after-the-fact?
I can, indeed, just do the aggregation into lists:
(
df_tbl.lazy()
.groupby_dynamic(index_column='Recorded', by='person',
period=window, offset=-window, every='1h', closed='both' )
.agg(
[pl.col('read1'), pl.col('read2'), pl.col('read3')]
).collect()
)
Which gives me the output in the form I expect:
┌────────┬─────────────────────┬──────────────────┬─────────────────┬─────────────────┐
│ person ┆ Recorded ┆ read1 ┆ read2 ┆ read3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞════════╪═════════════════════╪══════════════════╪═════════════════╪═════════════════╡
│ A ┆ 2020-08-20 11:00:00 ┆ [0.7] ┆ [2.3] ┆ [3.1] │
│ A ┆ 2020-08-20 12:00:00 ┆ [0.7] ┆ [2.3] ┆ [3.1] │
│ A ┆ 2020-08-20 13:00:00 ┆ [0.7, 3.7] ┆ [2.3, 3.7] ┆ [3.1, 4.1] │
│ A ┆ 2020-08-20 14:00:00 ┆ [0.7, 3.7] ┆ [2.3, 3.7] ┆ [3.1, 4.1] │
│ A ┆ 2020-08-20 15:00:00 ┆ [0.7, 3.7, -0.9] ┆ [2.3, 3.7, 0.3] ┆ [3.1, 4.1, 0.8] │
│ A ┆ 2020-08-20 16:00:00 ┆ [0.7, 3.7, -0.9] ┆ [2.3, 3.7, 0.3] ┆ [3.1, 4.1, 0.8] │
│ … ┆ … ┆ … ┆ … ┆ … │
│ B ┆ 2020-08-26 22:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
│ B ┆ 2020-08-26 23:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
│ B ┆ 2020-08-27 00:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
│ B ┆ 2020-08-27 01:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
│ B ┆ 2020-08-27 02:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
│ B ┆ 2020-08-27 03:00:00 ┆ [-0.8] ┆ [0.6] ┆ [1.4] │
└────────┴─────────────────────┴──────────────────┴─────────────────┴─────────────────┘
But now, I don't have a concept of how I can do operations on those lists in each row, using the pl.Expr syntax, or something similar, that I used in expr_func
above.
That is, how to do the "all at once" operations, as-if columnar, per group, on the list values.
I tried to do something like adding a groupnum
enumerating the groups, then exploding on that, then re-grouping by that, and doing the agg
as in the original implementation ... but I could not get the explode
to work on an int32 value columns :/
I did not even go down the route of using apply
, since it is always disastrously slow.
Any help on this would be appreciated and really beneficial. This is just a toy version of the problem, and ultimately a high-parameter (neural network) parameter space will be searched with the non-gradient solver, so major speedups will mean a lot.