Polars - Count unique values over a time period

Question

I'm migrating a pipeline from pandas to polars, the data is for arrivals and departures of trucks docked in a warehouse, in a certain step of the pipeline I need to calculate the number of trucks that are docked at any given time, that is, for every row (a truck docked) I need to calculate the number of unique trucks ("ID") that are there in a time window (- 1 minutes of arrival time and + 1 minutes of departure time). I sill haven't found a method that is efficient and doesn't rely on applying a function row by row ( pandas's style).

data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

The code I have so far, the first part is in polars and the last one still uses pandas:

processed_data = (
    data.sort(by=pl.col("arrival_time"))
    .with_columns(
        arrival_time_expanded=pl.col("arrival_time").dt.offset_by("-1m"),
        departure_time_expanded=pl.col("departure_time").dt.offset_by("1m"),
    )
    .to_pandas()
)
processed_data = processed_data.assign(
    docked_trucks=processed_data.apply(
        lambda row: processed_data[
            (processed_data.arrival_time_expanded <= row.departure_time)
            & (processed_data.departure_time_expanded >= row.arrival_time)
        ]["ID"].nunique(),
        axis=1,
    )
)

Result:

┌──────────────┬────────────────┬─────┬──────────────────────┬─────────────────────┬───────────────┐
│ arrival_time ┆ departure_time ┆ ID  ┆ arrival_time_expande ┆ departure_time_expa ┆ docked_trucks │
│ ---          ┆ ---            ┆ --- ┆ d                    ┆ nded                ┆ ---           │
│ datetime[ns] ┆ datetime[ns]   ┆ str ┆ ---                  ┆ ---                 ┆ i64           │
│              ┆                ┆     ┆ datetime[ns]         ┆ datetime[ns]        ┆               │
╞══════════════╪════════════════╪═════╪══════════════════════╪═════════════════════╪═══════════════╡
│ 2023-01-01   ┆ 2023-01-01     ┆ A1  ┆ 2023-01-01 06:22:47  ┆ 2023-01-01 06:26:08 ┆ 1             │
│ 06:23:47     ┆ 06:25:08       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A1  ┆ 2023-01-01 06:25:42  ┆ 2023-01-01 06:29:02 ┆ 1             │
│ 06:26:42     ┆ 06:28:02       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A5  ┆ 2023-01-01 06:29:20  ┆ 2023-01-01 06:36:01 ┆ 4             │
│ 06:30:20     ┆ 06:35:01       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:31:06  ┆ 2023-01-01 06:34:48 ┆ 4             │
│ 06:32:06     ┆ 06:33:48       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ B3  ┆ 2023-01-01 06:32:09  ┆ 2023-01-01 06:37:01 ┆ 4             │
│ 06:33:09     ┆ 06:36:01       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ C3  ┆ 2023-01-01 06:33:08  ┆ 2023-01-01 06:40:49 ┆ 4             │
│ 06:34:08     ┆ 06:39:49       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:35:40  ┆ 2023-01-01 06:39:34 ┆ 4             │
│ 06:36:40     ┆ 06:38:34       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A5  ┆ 2023-01-01 06:36:43  ┆ 2023-01-01 06:41:48 ┆ 3             │
│ 06:37:43     ┆ 06:40:48       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:38:48  ┆ 2023-01-01 06:47:10 ┆ 3             │
│ 06:39:48     ┆ 06:46:10       ┆     ┆                      ┆                     ┆               │
└──────────────┴────────────────┴─────┴──────────────────────┴─────────────────────┴───────────────┘

jqurious · Answer 1 · 2023-06-23T16:58:58.820

We can use .join_asof() to find the start/end "row numbers" for each "window".

df = (
   df.with_columns(   
      window_open  = pl.col("arrival_time").dt.offset_by("-1m"),
      window_close = pl.col("departure_time").dt.offset_by("1m")
   )
   .sort("arrival_time")
   .with_row_count("event_id")
)

start = (
   df.sort("window_open") 
     .join_asof(
        df.sort("departure_time").with_row_count("start"), 
        left_on  = "window_open",  
        right_on = "departure_time"
     )
     .select(pl.col("start").backward_fill())
)

end = (
   df.sort("window_close")
     .join_asof(
        df.sort("arrival_time").with_row_count("end"),   
        left_on  = "window_close", 
        right_on = "arrival_time", 
        strategy = "forward"
     )
     .select(pl.col("end").forward_fill() + 1)
)

We can then:

expand the range
convert to a struct
unnest the struct to create columns

shape: (9, 2)            shape: (9, 8)
┌───────┬─────┐          ┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ start ┆ end │          │ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ field_4 ┆ field_5 ┆ field_6 ┆ field_7 │
│ ---   ┆ --- │          │ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ u32   ┆ u32 │          │ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     ┆ i64     │
╞═══════╪═════╡          ╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ 0     ┆ 2   │          │ 0       ┆ 1       ┆ null    ┆ null    ┆ null    ┆ null    ┆ null    ┆ null    │
│ 0     ┆ 3   │          │ 0       ┆ 1       ┆ 2       ┆ null    ┆ null    ┆ null    ┆ null    ┆ null    │
│ 1     ┆ 7   │    ->    │ 1       ┆ 2       ┆ 3       ┆ 4       ┆ 5       ┆ 6       ┆ null    ┆ null    │
│ 1     ┆ 7   │          │ 1       ┆ 2       ┆ 3       ┆ 4       ┆ 5       ┆ 6       ┆ null    ┆ null    │
│ 1     ┆ 8   │          │ 1       ┆ 2       ┆ 3       ┆ 4       ┆ 5       ┆ 6       ┆ 7       ┆ null    │
│ 1     ┆ 9   │          │ 1       ┆ 2       ┆ 3       ┆ 4       ┆ 5       ┆ 6       ┆ 7       ┆ 8       │
│ 3     ┆ 9   │          │ 3       ┆ 4       ┆ 5       ┆ 6       ┆ 7       ┆ 8       ┆ null    ┆ null    │
│ 4     ┆ 9   │          │ 4       ┆ 5       ┆ 6       ┆ 7       ┆ 8       ┆ null    ┆ null    ┆ null    │
│ 5     ┆ 9   │          │ 5       ┆ 6       ┆ 7       ┆ 8       ┆ null    ┆ null    ┆ null    ┆ null    │
└───────┴─────┘          └─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

We can then pass the field_\d+ values to .take() to get the corresponding column values.

Update: I realized that it's simpler and faster to expand the range lengthways instead of sideways.

This gives you the rows for each group and you can then .take(idx) to fill in the corresponding values.

You can then .join the groups with the original rows to compare.

df = df.lazy()

start = start.lazy()
end = end.lazy()

groups = (
   df.select('event_id').with_context([start, end])
     .select('event_id', pl.arange('start', 'end'))
     .explode('arange')
     .with_context(df)
     .select(pl.all().take(pl.col('arange')), group = 'event_id')
)

(df.join(groups, left_on='event_id', right_on='group', how='left')
   .filter(pl.all(
      pl.col("window_open")  <= pl.col("departure_time_right"),
      pl.col("window_close") >= pl.col("arrival_time_right")
   ))
   .groupby('window_open', 'window_close', maintain_order=True)
   .agg(
      pl.col('arrival_time', 'departure_time', 'ID').first(), 
      docked_trucks = pl.n_unique('ID_right')
   )
   .collect()
)

Sideways approach:

(pl.concat([df, start, end], how="horizontal")
   .with_columns(
      pl.arange("start", "end").list.to_struct(n_field_strategy="max_width")
   )
   .unnest("arange")
   .with_columns(
      arrival   = pl.concat_list(pl.col("arrival_time").take("^field_\d+$")),
      departure = pl.concat_list(pl.col("departure_time").take("^field_\d+$")),
      truck_id  = pl.concat_list(pl.col("ID").take("^field_\d+$"))
   )
   .explode("arrival", "departure", "truck_id")
   .filter(
      pl.all(
         pl.col("window_open")  <= pl.col("departure"), 
         pl.col("window_close") >= pl.col("arrival")
      )
   )
   .groupby("window_open", "window_close")
   .agg(
      pl.col("event_id", "arrival_time", "departure_time", "ID").first(), 
      docked_trucks = pl.col("truck_id").n_unique()
   )
   .sort("event_id") # sort and re-order columns for output
   .select( 
       "event_id",
       "arrival_time",
       "departure_time",
       "ID",
       "window_open",
       "window_close",
       "docked_trucks",
   )
)

Output:

shape: (9, 7)
┌──────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────────────┐
│ event_id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ docked_trucks │
│ ---      ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---           │
│ u32      ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32           │
╞══════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════════════╡
│ 0        ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 1             │
│ 1        ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 1             │
│ 2        ┆ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ 2023-01-01 06:29:20 ┆ 2023-01-01 06:36:01 ┆ 4             │
│ 3        ┆ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ 2023-01-01 06:31:06 ┆ 2023-01-01 06:34:48 ┆ 4             │
│ 4        ┆ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ 2023-01-01 06:32:09 ┆ 2023-01-01 06:37:01 ┆ 4             │
│ 5        ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 ┆ 4             │
│ 6        ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 ┆ 4             │
│ 7        ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 ┆ 3             │
│ 8        ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 ┆ 3             │
└──────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────────────┘

This is orders of magnitude faster than my original code but I don't get the same results when applying it to my dataset, it works with the example dataset that I provided in my question but not with the dataset I'm working with, so your answer is correct. — JuanPy, Jun 19 '23 at 09:40
Interesting. If you can figure out an example dataset where it differs please post it so I can try to modify the answer. — jqurious, Jun 19 '23 at 09:56
I've edited my question to include an example DataFrame that is closer to the actual data, with this new example, the code you provided returns a different output from the original code. — JuanPy, Jun 19 '23 at 10:42
Thanks for the update. It's actually much more complex than I initially realized. The problem is there are multiple overlapping windows, so assigning a single window id like I did for the first example is wrong. — jqurious, Jun 20 '23 at 08:53
The updated code works and it's around 50x faster than the original code. Thanks. — JuanPy, Jun 20 '23 at 11:16
I realized that it's simpler to go lengthways instead of going sideways and adding new columns. It's also much faster. I've added in an updated approach. — jqurious, Jun 23 '23 at 16:58

alexp · Answer 2 · 2023-06-19T11:27:45.090

So i came up with two options both can sadly explode with huge amount of data:

First solution:

You first do a cross join and then filter the wrong results and then do a groupby. Sadly due to the cross join this query can explode with a lot of data.

data = processed_data.join(data, how="cross").filter(
    (pl.col("arrival_time_expanded") < pl.col("departure_time_right"))
    & (pl.col("departure_time_expanded") > pl.col("arrival_time_right"))
)
data.groupby(
    pl.col(
        "arrival_time",
        "departure_time",
        "ID",
        "arrival_time_expanded",
        "departure_time_expanded",
    )
).agg(pl.count())

Second Solution:

This might perform maybe better with big data, but I haven't tested it, but you also loose a little bit of precision. To reduce cardinality we round the arrival and departure time to the next minute and then explode the table to have for each minute which truck was in at the warehouse.

time_precision = "1m"

processed_data = data.sort(by=pl.col("arrival_time")).with_columns(
    arrival_time_expanded=pl.col("arrival_time").dt.round(time_precision).dt.offset_by("-1m"),
    departure_time_expanded=pl.col("departure_time").dt.round(time_precision).dt.offset_by("1m"),
)
processed_data = processed_data.with_columns(
    pl.date_range(
        pl.col("arrival_time_expanded"),
        pl.col("departure_time_expanded"),
        interval=time_precision,
        eager=False,
    ).alias("interval")
).explode("interval").unique(["ID", "interval"])

processed_data.with_columns(pl.count().over(pl.col("interval"))).groupby(
    pl.col(
        "arrival_time",
        "departure_time",
        "ID",
        "arrival_time_expanded",
        "departure_time_expanded",
    )
).agg(pl.col("count").max()).sort("arrival_time")


shape: (9, 6)
┌─────────────────────┬─────────────────────┬─────┬───────────────────────┬─────────────────────────┬───────┐
│ arrival_time        ┆ departure_time      ┆ ID  ┆ arrival_time_expanded ┆ departure_time_expanded ┆ count │
│ ---                 ┆ ---                 ┆ --- ┆ ---                   ┆ ---                     ┆ ---   │
│ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]          ┆ datetime[μs]            ┆ u32   │
╞═════════════════════╪═════════════════════╪═════╪═══════════════════════╪═════════════════════════╪═══════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:23:00   ┆ 2023-01-01 06:26:00     ┆ 1     │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:26:00   ┆ 2023-01-01 06:29:00     ┆ 2     │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ 2023-01-01 06:29:00   ┆ 2023-01-01 06:36:00     ┆ 4     │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ 2023-01-01 06:31:00   ┆ 2023-01-01 06:35:00     ┆ 4     │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ 2023-01-01 06:32:00   ┆ 2023-01-01 06:37:00     ┆ 4     │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:00   ┆ 2023-01-01 06:41:00     ┆ 4     │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:36:00   ┆ 2023-01-01 06:40:00     ┆ 4     │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:37:00   ┆ 2023-01-01 06:42:00     ┆ 4     │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:39:00   ┆ 2023-01-01 06:47:00     ┆ 3     │
└─────────────────────┴─────────────────────┴─────┴───────────────────────┴─────────────────────────┴───────┘

The cross join option worked but it was significantly slower than the original code. I couldn't make the second solution work with my dataset. I will explore it further to see if I can find a workaround. — JuanPy, Jun 19 '23 at 09:36
So I updated the second solution with your updated data. I realized a problem with my second solution is, that I don't compare the overlaps between expanded and normal times but the overlaps of the expanded times — alexp, Jun 19 '23 at 11:29

Polars - Count unique values over a time period

2 Answers2

First solution:

Second Solution: