I have some timeseries data in a polars LazyFrame where I am detecting events based on extended periods in which some criteria is true. The signal I am basing this on isn't reliable, so if there is a very small amount of time between events, I roll these in to one, and I expand the time associated with each event slightly to avoid cutting off data.
Through a series of groupbys and aggregations, I end up with a derived dataframe with a start_time and end_time.
The end goal is to be able to get a lazily filtered version of the lazyframe lf
, based on the contents of lf_events_groupby
in the example below.
(Apologies for the large setup, I wanted to make sure the example was complete!)
Example setup
Generate dummy data using Perlin noise
import polars as pl
import pendulum
import numpy as np
import noise
import matplotlib.pyplot as plt
# Generate time series
t_start = pendulum.datetime(2023,1,1, tz = None)
t_end = pendulum.datetime(2023,2,1, tz = None)
t = pl.date_range(t_start, t_end, interval='30s')
# Generate continuous random
y = np.array([noise.pnoise1(i, octaves=4) for i in np.linspace(0, 10, len(t))])
lf = pl.DataFrame({'t': t, 'data': y}).lazy() # This would usually be generated by a pl.scan_parquet() command.
View raw data
df = lf.collect()
plt.plot(df['t'], df['data'])
ax = plt.gca()
Event detection
As an example, we'll say a data point is part of an event if it has a positive value. We detect events by:
lf = lf.with_column(pl.col('data').gt(0).alias('is_event'))
And view the corresponding events:
df = lf.collect()
plt.plot(df['t'], df['data'])
ax = plt.gca()
ax.fill_between(df['t'], 0, 1,
where=df['is_event'],
color='green', alpha=0.5,
transform=ax.get_xaxis_transform())
Event splitting and grouping
# Only look at event data
lf_events = lf.filter("is_event")
# Calculate time since last data point
lf_events = lf_events.with_column(pl.col("t").diff().alias("time_since_last_event").fill_null(pl.duration()))
# Label point as new event if time since last data point is above a certain threshold (10 minutes)
lf_events = lf_events.with_column(pl.col("time_since_last_event").gt(pl.duration(minutes=10)).alias("new_event"))
# Rising edge counter
lf_events = lf_events.with_column(pl.col("new_event").cast(pl.Int8).diff().eq(1).cumsum().alias("event_number"))
# Groupby events, extend search window
lf_events_groupby = lf_events.groupby('event_number', maintain_order = True).agg([
pl.col('t').min().alias('start_time'),
pl.col('t').min().alias('end_time'),
(pl.col('t').min() - pendulum.duration(seconds=30) ).alias('start_extended'),
(pl.col('t').max() + pendulum.duration(seconds=30) ).alias('end_extended'),
])
The contents of lf_events_groupby
is as follows:
┌──────────────┬──────────────┬──────────────────────┬──────────────────────┬──────────────────────┐
│ event_number ┆ start_time ┆ end_time ┆ start_extended ┆ end_extended │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] │
╞══════════════╪══════════════╪══════════════════════╪══════════════════════╪══════════════════════╡
│ 0 ┆ 2023-01-01 ┆ 2023-01-01 00:00:30 ┆ 2023-01-01 00:00:00 ┆ 2023-01-04 02:24:00 │
│ ┆ 00:00:30 ┆ ┆ ┆ │
│ 1 ┆ 2023-01-04 ┆ 2023-01-04 08:19:00 ┆ 2023-01-04 08:18:30 ┆ 2023-01-07 04:48:00 │
│ ┆ 08:19:00 ┆ ┆ ┆ │
│ 2 ┆ 2023-01-08 ┆ 2023-01-08 00:47:30 ┆ 2023-01-08 00:47:00 ┆ 2023-01-08 05:43:00 │
│ ┆ 00:47:30 ┆ ┆ ┆ │
│ 3 ┆ 2023-01-08 ┆ 2023-01-08 18:00:30 ┆ 2023-01-08 18:00:00 ┆ 2023-01-08 23:41:00 │
│ ┆ 18:00:30 ┆ ┆ ┆ │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 14 ┆ 2023-01-24 ┆ 2023-01-24 06:00:30 ┆ 2023-01-24 06:00:00 ┆ 2023-01-25 10:50:00 │
│ ┆ 06:00:30 ┆ ┆ ┆ │
│ 15 ┆ 2023-01-25 ┆ 2023-01-25 19:12:30 ┆ 2023-01-25 19:12:00 ┆ 2023-01-26 06:26:00 │
│ ┆ 19:12:30 ┆ ┆ ┆ │
│ 16 ┆ 2023-01-27 ┆ 2023-01-27 08:24:30 ┆ 2023-01-27 08:24:00 ┆ 2023-01-28 00:50:30 │
│ ┆ 08:24:30 ┆ ┆ ┆ │
│ 17 ┆ 2023-01-28 ┆ 2023-01-28 21:36:30 ┆ 2023-01-28 21:36:00 ┆ 2023-01-29 20:48:30 │
│ ┆ 21:36:30 ┆ ┆ ┆ │
└──────────────┴──────────────┴──────────────────────┴──────────────────────┴──────────────────────┘
The goal
Some operation where I pass lf_events_groupby
into an operation on lf
, such as to get the raw data split by which event it belongs to. I could do this for an individual event with lf.filter()
and .is_between
, but I can't work out how to do this without resorting to loops or other approaches which break the parallelism of polars.
I've read through the documentation and can't see anything which addresses this specific case.