python-polars: Group dataframe by values in another Dataframe

Question

I have 2 dataframes:

df1

┌─────────────────────┬─────────────────────┬─────┐
│ start               ┆ end                 ┆ id  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════════════════════╪═════╡
│ 2022-01-10 01:02:03 ┆ 2022-01-10 03:02:03 ┆ 1   │
│ 2022-01-10 05:02:03 ┆ 2022-01-10 07:02:03 ┆ 1   │
│ 2022-01-10 10:02:03 ┆ 2022-01-10 12:02:03 ┆ 2   │
│ 2022-01-10 03:02:03 ┆ 2022-01-10 05:02:03 ┆ 2   │
│ 2022-01-10 07:02:03 ┆ 2022-01-10 09:02:03 ┆ 3   │
│ 2022-01-10 02:02:03 ┆ 2022-01-10 04:02:03 ┆ 4   │
│ 2022-01-10 01:02:03 ┆ 2022-01-10 03:02:03 ┆ 3   │
│ 2022-01-10 02:02:03 ┆ 2022-01-10 04:02:03 ┆ 2   │
└─────────────────────┴─────────────────────┴─────┘

df2

┌─────┬─────────────────────┬──────┐
│ id  ┆ time                ┆ data │
│ --- ┆ ---                 ┆ ---  │
│ i64 ┆ datetime[μs]        ┆ i64  │
╞═════╪═════════════════════╪══════╡
│ 4   ┆ 2022-01-10 08:53:03 ┆ 21   │
│ 4   ┆ 2022-01-10 06:22:03 ┆ 21   │
│ 1   ┆ 2022-01-10 04:08:03 ┆ 20   │
│ 2   ┆ 2022-01-10 09:20:03 ┆ 29   │
│ ... ┆ ...                 ┆ ...  │
│ 3   ┆ 2022-01-10 03:59:03 ┆ 30   │
│ 1   ┆ 2022-01-10 07:44:03 ┆ 24   │
│ 1   ┆ 2022-01-10 06:40:03 ┆ 28   │
│ 2   ┆ 2022-01-10 04:13:03 ┆ 24   │
└─────┴─────────────────────┴──────┘

I want to split df2 into multiple dataframes, based on the start, end, and id columns from df1.

I am currently doing this like this, but I want to avoid using a for loop. Are there any more efficient ways of doing this? Perhaps with df.groupby?

My current solution:

import polars as pl
from datetime import datetime

### Just some example data ####

records1 = []
dates = [1, 5, 10, 3, 7, 2, 1, 2]
idd   = [1, 1, 2, 2, 3, 4, 3, 2, 1]

for i in range(len(dates)):
    start = datetime(2022, 1, 10, dates[i], 2, 3)
    end = datetime(2022, 1, 10, dates[i]+2, 2, 3)
    records1.append({'start':start, 'end':end, 'id':idd[i]})

df1 = pl.from_records(records1)

records2 = []
hours = [8, 6, 4, 9, 4, 0, 6, 2, 7, 0, 0, 3, 7, 6, 4]
minutes  = [53, 22, 8, 20, 53, 30, 58, 15, 49, 36, 30, 59, 44, 40, 13]
idd   = [4, 4, 1, 2, 3, 4, 1, 3, 2, 2, 1, 3, 1, 1, 2]
data = [21, 21, 20, 29, 22, 25, 20, 26, 23, 21, 30, 30, 24, 28, 24]

for i in range(len(data)):
    time = datetime(2022, 1, 10, hours[i], minutes[i], 3)
    records2.append({'id': idd[i], 'time':time, 'data':data[i]})

df2 = pl.from_records(records2)

################### Start of code ##################

for i in range(len(df1)):
    df3 = df2.filter(
        (pl.col('time').is_between(df1['start'][i],  df1['end'][i])) & 
        (pl.col('id') == df1['id'][i]))
   
    print(df3)

score 0 · Answer 1 · answered Jan 10 '23 at 17:56

It looks like you're doing the equivalent of a .join().filter()

(df2.join(df1, on="id")
    .filter(pl.col("time").is_between(pl.col("start"), pl.col("end"))))

shape: (4, 5)
┌─────┬─────────────────────┬──────┬─────────────────────┬─────────────────────┐
│ id  | time                | data | start               | end                 │
│ --- | ---                 | ---  | ---                 | ---                 │
│ i64 | datetime[μs]        | i64  | datetime[μs]        | datetime[μs]        │
╞═════╪═════════════════════╪══════╪═════════════════════╪═════════════════════╡
│ 1   | 2022-01-10 06:58:03 | 20   | 2022-01-10 05:02:03 | 2022-01-10 07:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 3   | 2022-01-10 02:15:03 | 26   | 2022-01-10 01:02:03 | 2022-01-10 03:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 1   | 2022-01-10 06:40:03 | 28   | 2022-01-10 05:02:03 | 2022-01-10 07:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 2   | 2022-01-10 04:13:03 | 24   | 2022-01-10 03:02:03 | 2022-01-10 05:02:03 │
└─────┴─────────────────────┴──────┴─────────────────────┴─────────────────────┘

You can then .groupby(["id", "start", "end"]) to process each interval/range.

Depending if there are overlapping intervals or not you could also use .join_asof() as explained in Python Polars join on column with greater or equal which is more efficient.

(df2.sort("time")
    .join_asof(
        df1.sort("start").with_column(pl.col("start").alias("time")), 
        on="time", 
        by="id")
    .filter(pl.col("time").is_between(pl.col("start"), pl.col("end"))))

shape: (4, 5)
┌─────┬─────────────────────┬──────┬─────────────────────┬─────────────────────┐
│ id  | time                | data | start               | end                 │
│ --- | ---                 | ---  | ---                 | ---                 │
│ i64 | datetime[μs]        | i64  | datetime[μs]        | datetime[μs]        │
╞═════╪═════════════════════╪══════╪═════════════════════╪═════════════════════╡
│ 3   | 2022-01-10 02:15:03 | 26   | 2022-01-10 01:02:03 | 2022-01-10 03:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 2   | 2022-01-10 04:13:03 | 24   | 2022-01-10 03:02:03 | 2022-01-10 05:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 1   | 2022-01-10 06:40:03 | 28   | 2022-01-10 05:02:03 | 2022-01-10 07:02:03 │
├─────┼─────────────────────┼──────┼─────────────────────┼─────────────────────┤
│ 1   | 2022-01-10 06:58:03 | 20   | 2022-01-10 05:02:03 | 2022-01-10 07:02:03 │
└─────┴─────────────────────┴──────┴─────────────────────┴─────────────────────┘

Dean MacGregor · Answer 2 · 2023-01-10T18:12:57.610

You can do an as_of join to find the values where start is closest to time (without going over) and then filter out any rows where time is bigger than end. Doing an as_of join requires the dfs be sorted first so it'll look like this:

(df2.sort('time')
 .join_asof(df1.sort('start'), 
            left_on='time', right_on='start', by='id', strategy='backward')
 .filter(pl.col('time')<=pl.col('end')))

The left_on and right_on match the not-equal columns where a match comes from being closest (in just one direction) while the by parameter is to specify exact match columns, in this case id. The strategy specifies that the values of the left column should go in that direction (backwards) until it finds a match.

You could also do it in the opposite direction if you want:

(df2.sort('time')
 .join_asof(df1.sort('end'), 
            left_on='time', right_on='end', by='id', strategy='forward')
 .filter(pl.col('time')>=pl.col('start')))

Either way you get:

shape: (4, 5)
┌─────┬─────────────────────┬──────┬─────────────────────┬─────────────────────┐
│ id  ┆ time                ┆ data ┆ start               ┆ end                 │
│ --- ┆ ---                 ┆ ---  ┆ ---                 ┆ ---                 │
│ i64 ┆ datetime[μs]        ┆ i64  ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════╪═════════════════════╪══════╪═════════════════════╪═════════════════════╡
│ 3   ┆ 2022-01-10 02:15:03 ┆ 26   ┆ 2022-01-10 01:02:03 ┆ 2022-01-10 03:02:03 │
│ 2   ┆ 2022-01-10 04:13:03 ┆ 24   ┆ 2022-01-10 03:02:03 ┆ 2022-01-10 05:02:03 │
│ 1   ┆ 2022-01-10 06:40:03 ┆ 28   ┆ 2022-01-10 05:02:03 ┆ 2022-01-10 07:02:03 │
│ 1   ┆ 2022-01-10 06:58:03 ┆ 20   ┆ 2022-01-10 05:02:03 ┆ 2022-01-10 07:02:03 │
└─────┴─────────────────────┴──────┴─────────────────────┴─────────────────────┘

Another thing to remember is that it is a left-join so whichever df you want the rows from is the one from which you call the method. In other words, you can't do a version of df1.join_asof(df2,...) to get your desired result in this case.

python-polars: Group dataframe by values in another Dataframe

2 Answers2