I'm migrating a pipeline from pandas
to polars
, the data is for arrivals and departures of trucks docked in a warehouse, in a certain step of the pipeline I need to calculate the number of trucks that are docked at any given time, that is, for every row (a truck docked) I need to calculate the number of unique trucks ("ID") that are there in a time window (- 1 minutes of arrival time and + 1 minutes of departure time). I sill haven't found a method that is efficient and doesn't rely on applying a function row by row ( pandas
's style).
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""")
The code I have so far, the first part is in polars
and the last one still uses pandas
:
processed_data = (
data.sort(by=pl.col("arrival_time"))
.with_columns(
arrival_time_expanded=pl.col("arrival_time").dt.offset_by("-1m"),
departure_time_expanded=pl.col("departure_time").dt.offset_by("1m"),
)
.to_pandas()
)
processed_data = processed_data.assign(
docked_trucks=processed_data.apply(
lambda row: processed_data[
(processed_data.arrival_time_expanded <= row.departure_time)
& (processed_data.departure_time_expanded >= row.arrival_time)
]["ID"].nunique(),
axis=1,
)
)
Result:
┌──────────────┬────────────────┬─────┬──────────────────────┬─────────────────────┬───────────────┐
│ arrival_time ┆ departure_time ┆ ID ┆ arrival_time_expande ┆ departure_time_expa ┆ docked_trucks │
│ --- ┆ --- ┆ --- ┆ d ┆ nded ┆ --- │
│ datetime[ns] ┆ datetime[ns] ┆ str ┆ --- ┆ --- ┆ i64 │
│ ┆ ┆ ┆ datetime[ns] ┆ datetime[ns] ┆ │
╞══════════════╪════════════════╪═════╪══════════════════════╪═════════════════════╪═══════════════╡
│ 2023-01-01 ┆ 2023-01-01 ┆ A1 ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 1 │
│ 06:23:47 ┆ 06:25:08 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A1 ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 1 │
│ 06:26:42 ┆ 06:28:02 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A5 ┆ 2023-01-01 06:29:20 ┆ 2023-01-01 06:36:01 ┆ 4 │
│ 06:30:20 ┆ 06:35:01 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A6 ┆ 2023-01-01 06:31:06 ┆ 2023-01-01 06:34:48 ┆ 4 │
│ 06:32:06 ┆ 06:33:48 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ B3 ┆ 2023-01-01 06:32:09 ┆ 2023-01-01 06:37:01 ┆ 4 │
│ 06:33:09 ┆ 06:36:01 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ C3 ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 ┆ 4 │
│ 06:34:08 ┆ 06:39:49 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A6 ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 ┆ 4 │
│ 06:36:40 ┆ 06:38:34 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A5 ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 ┆ 3 │
│ 06:37:43 ┆ 06:40:48 ┆ ┆ ┆ ┆ │
│ 2023-01-01 ┆ 2023-01-01 ┆ A6 ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 ┆ 3 │
│ 06:39:48 ┆ 06:46:10 ┆ ┆ ┆ ┆ │
└──────────────┴────────────────┴─────┴──────────────────────┴─────────────────────┴───────────────┘