I am trying to compute various statistics on groups of timeseries data using the duration of the points (time until the next point). I would like the duration of the last point in a group to be the time until the boundary of the group.
Crucially I want this to happen in the lazy context without materializing the entire dataframe.
Consider code like the following:
use polars::prelude::*;
let dates = Series::new(
"date_str",
[
"2020-01-01 00:01:00",
"2020-01-01 00:03:50",
"2020-01-01 00:04:10",
"2020-01-01 00:06:50",
"2020-01-01 00:07:00",
"2020-01-01 00:09:50",
],
);
let df = dates.into_frame().lazy().with_column(
col("date_str")
.str()
.strptime(StrpTimeOptions {
date_dtype: DataType::Datetime(TimeUnit::Milliseconds, None),
fmt: Some("%Y-%m-%d %H:%M:%S".into()),
strict: true,
exact: true,
cache: true,
tz_aware: false,
utc: false,
})
.alias("date"),
);
dbg!(df.clone().collect().unwrap());
let grp = df.clone().groupby_dynamic(
[],
DynamicGroupOptions {
index_column: "date".into(),
every: Duration::parse("3m"),
period: Duration::parse("3m"),
offset: Duration::parse("0s"),
truncate: false,
include_boundaries: true,
closed_window: ClosedWindow::Both,
start_by: StartBy::DataPoint,
},
);
let out_df = grp
.clone()
.agg([col("date")
.diff(1, NullBehavior::Ignore)
.min()
.alias("min_duration")])
.collect()
.unwrap();
dbg!(out_df);
which gives final output
┌─────────────────────┬─────────────────────┬─────────────────────┬──────────────┐
│ _lower_boundary ┆ _upper_boundary ┆ date ┆ min_duration │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ datetime[ms] ┆ datetime[ms] ┆ duration[ms] │
╞═════════════════════╪═════════════════════╪═════════════════════╪══════════════╡
│ 2020-01-01 00:01:00 ┆ 2020-01-01 00:04:00 ┆ 2020-01-01 00:01:00 ┆ 2m 50s │
│ 2020-01-01 00:04:00 ┆ 2020-01-01 00:07:00 ┆ 2020-01-01 00:04:10 ┆ 10s │
│ 2020-01-01 00:07:00 ┆ 2020-01-01 00:10:00 ┆ 2020-01-01 00:07:00 ┆ 2m 50s │
└─────────────────────┴─────────────────────┴─────────────────────┴──────────────┘
While I would want them all to give min_duration
10s, since all the groups have a point 10s before the group boundary.
As far as I know you can not access the boundary Polars computes, something like:
let out_df = grp
.agg([col("date")
.append(col("_upper_boundary"), false)
.diff(1, NullBehavior::Ignore)
.min()
.alias("min_duration")])
.collect()
.unwrap();
dbg!(out_df);
will error as the column is not available before you collect.