2

I am trying to compute various statistics on groups of timeseries data using the duration of the points (time until the next point). I would like the duration of the last point in a group to be the time until the boundary of the group.

Crucially I want this to happen in the lazy context without materializing the entire dataframe.

Consider code like the following:

    use polars::prelude::*;

    let dates = Series::new(
        "date_str",
        [
            "2020-01-01 00:01:00",
            "2020-01-01 00:03:50",
            "2020-01-01 00:04:10",
            "2020-01-01 00:06:50",
            "2020-01-01 00:07:00",
            "2020-01-01 00:09:50",
        ],
    );
    let df = dates.into_frame().lazy().with_column(
        col("date_str")
            .str()
            .strptime(StrpTimeOptions {
                date_dtype: DataType::Datetime(TimeUnit::Milliseconds, None),
                fmt: Some("%Y-%m-%d %H:%M:%S".into()),
                strict: true,
                exact: true,
                cache: true,
                tz_aware: false,
                utc: false,
            })
            .alias("date"),
    );
    dbg!(df.clone().collect().unwrap());

    let grp = df.clone().groupby_dynamic(
        [],
        DynamicGroupOptions {
            index_column: "date".into(),
            every: Duration::parse("3m"),
            period: Duration::parse("3m"),
            offset: Duration::parse("0s"),
            truncate: false,
            include_boundaries: true,
            closed_window: ClosedWindow::Both,
            start_by: StartBy::DataPoint,
        },
    );

    let out_df = grp
        .clone()
        .agg([col("date")
            .diff(1, NullBehavior::Ignore)
            .min()
            .alias("min_duration")])
        .collect()
        .unwrap();
    dbg!(out_df);

which gives final output

┌─────────────────────┬─────────────────────┬─────────────────────┬──────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ min_duration │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---          │
│ datetime[ms]        ┆ datetime[ms]        ┆ datetime[ms]        ┆ duration[ms] │
╞═════════════════════╪═════════════════════╪═════════════════════╪══════════════╡
│ 2020-01-01 00:01:00 ┆ 2020-01-01 00:04:00 ┆ 2020-01-01 00:01:00 ┆ 2m 50s       │
│ 2020-01-01 00:04:00 ┆ 2020-01-01 00:07:00 ┆ 2020-01-01 00:04:10 ┆ 10s          │
│ 2020-01-01 00:07:00 ┆ 2020-01-01 00:10:00 ┆ 2020-01-01 00:07:00 ┆ 2m 50s       │
└─────────────────────┴─────────────────────┴─────────────────────┴──────────────┘

While I would want them all to give min_duration 10s, since all the groups have a point 10s before the group boundary.

As far as I know you can not access the boundary Polars computes, something like:

    let out_df = grp
        .agg([col("date")
            .append(col("_upper_boundary"), false)
            .diff(1, NullBehavior::Ignore)
            .min()
            .alias("min_duration")])
        .collect()
        .unwrap();

    dbg!(out_df);

will error as the column is not available before you collect.

0 Answers0