I have 2 dataframes:
df1
┌─────────────────────┬─────────────────────┬─────┐
│ start ┆ end ┆ id │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ i64 │
╞═════════════════════╪═════════════════════╪═════╡
│ 2022-01-10 01:02:03 ┆ 2022-01-10 03:02:03 ┆ 1 │
│ 2022-01-10 05:02:03 ┆ 2022-01-10 07:02:03 ┆ 1 │
│ 2022-01-10 10:02:03 ┆ 2022-01-10 12:02:03 ┆ 2 │
│ 2022-01-10 03:02:03 ┆ 2022-01-10 05:02:03 ┆ 2 │
│ 2022-01-10 07:02:03 ┆ 2022-01-10 09:02:03 ┆ 3 │
│ 2022-01-10 02:02:03 ┆ 2022-01-10 04:02:03 ┆ 4 │
│ 2022-01-10 01:02:03 ┆ 2022-01-10 03:02:03 ┆ 3 │
│ 2022-01-10 02:02:03 ┆ 2022-01-10 04:02:03 ┆ 2 │
└─────────────────────┴─────────────────────┴─────┘
df2
┌─────┬─────────────────────┬──────┐
│ id ┆ time ┆ data │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 │
╞═════╪═════════════════════╪══════╡
│ 4 ┆ 2022-01-10 08:53:03 ┆ 21 │
│ 4 ┆ 2022-01-10 06:22:03 ┆ 21 │
│ 1 ┆ 2022-01-10 04:08:03 ┆ 20 │
│ 2 ┆ 2022-01-10 09:20:03 ┆ 29 │
│ ... ┆ ... ┆ ... │
│ 3 ┆ 2022-01-10 03:59:03 ┆ 30 │
│ 1 ┆ 2022-01-10 07:44:03 ┆ 24 │
│ 1 ┆ 2022-01-10 06:40:03 ┆ 28 │
│ 2 ┆ 2022-01-10 04:13:03 ┆ 24 │
└─────┴─────────────────────┴──────┘
I want to split df2
into multiple dataframes, based on the start
, end
, and id
columns from df1
.
I am currently doing this like this, but I want to avoid using a for
loop. Are there any more efficient ways of doing this? Perhaps with df.groupby
?
My current solution:
import polars as pl
from datetime import datetime
### Just some example data ####
records1 = []
dates = [1, 5, 10, 3, 7, 2, 1, 2]
idd = [1, 1, 2, 2, 3, 4, 3, 2, 1]
for i in range(len(dates)):
start = datetime(2022, 1, 10, dates[i], 2, 3)
end = datetime(2022, 1, 10, dates[i]+2, 2, 3)
records1.append({'start':start, 'end':end, 'id':idd[i]})
df1 = pl.from_records(records1)
records2 = []
hours = [8, 6, 4, 9, 4, 0, 6, 2, 7, 0, 0, 3, 7, 6, 4]
minutes = [53, 22, 8, 20, 53, 30, 58, 15, 49, 36, 30, 59, 44, 40, 13]
idd = [4, 4, 1, 2, 3, 4, 1, 3, 2, 2, 1, 3, 1, 1, 2]
data = [21, 21, 20, 29, 22, 25, 20, 26, 23, 21, 30, 30, 24, 28, 24]
for i in range(len(data)):
time = datetime(2022, 1, 10, hours[i], minutes[i], 3)
records2.append({'id': idd[i], 'time':time, 'data':data[i]})
df2 = pl.from_records(records2)
################### Start of code ##################
for i in range(len(df1)):
df3 = df2.filter(
(pl.col('time').is_between(df1['start'][i], df1['end'][i])) &
(pl.col('id') == df1['id'][i]))
print(df3)