Compare 2 tables in Polars and select a value based on that comparison

Question

I have a table like this in polars:

arrival_time	Train
08:40:10	112
19:31:26	134

An I have another table that defines the period of the day based on the hours:

Time	Period
08:00:00	Early
16:00:00	Afternoon

What I am trying to achieve is a Polars way to combine both tables and obtain in the first table the Period of the day as a new column:

arrival_time	Train	Period
08:40:10	112	Early
19:31:26	134	Afternoon

For now what I am doing is working entirely with dictionaries, zipping the 2 columns of my comparison table and computing the key of the minimium distance beetween the 2 time columns:

min(dict(zip( df.Period,df.Time)).items(), key=lambda x: abs(pl.col('arrival_time') - x[1]))[0])

But i am certainly sure that there's a better way to process in Polars.

Can you add the code to create your data next time in the question? That helps a lot when trying to answer the questions. — ritchie46, Jun 21 '22 at 12:44

score 3 · Answer 1 · answered Jun 21 '22 at 12:49

Polars has join_asof which joins to the closest key forward or backward in time.


from datetime import time

df_a = pl.DataFrame({
    "arrival_time": [time(8, 40, 10), time(19, 31, 26)],
    "train": [112, 134]
})

df_b = pl.DataFrame({
    "arrival_time": [time(8), time(16)],
    "period": ["early", "afternoon"]
    
})

print(df_a.join_asof(df_b, on="arrival_time"))

shape: (2, 3)
┌──────────────┬───────┬───────────┐
│ arrival_time ┆ train ┆ period    │
│ ---          ┆ ---   ┆ ---       │
│ time         ┆ i64   ┆ str       │
╞══════════════╪═══════╪═══════════╡
│ 08:40:10     ┆ 112   ┆ early     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19:31:26     ┆ 134   ┆ afternoon │
└──────────────┴───────┴───────────┘

score 2 · Answer 2 · answered Jun 22 '22 at 00:03

Buildling on @ritchie46 answer, if you need to find the closest value, you can apply join_asof in both the forward and backward direction, and choose the result that is closest.

For example, if we start with this data:

df_a = pl.DataFrame(
    {"arrival_time": [time(8, 40, 10), time(8, 59, 0), time(
        19, 31, 26)], "train": [112, 113, 134]}
)
df_b = pl.DataFrame(
    {
        "arrival_time": [
            time(8),
            time(8, 30, 0),
            time(9),
            time(16),
            time(16, 30, 0),
            time(17),
        ],
        "period": [
            "early morning",
            "mid-morning",
            "late morning",
            "early afternoon",
            "mid afternoon",
            "late afternoon",
        ],
    }
)
df_a
df_b

shape: (3, 2)
┌──────────────┬───────┐
│ arrival_time ┆ train │
│ ---          ┆ ---   │
│ time         ┆ i64   │
╞══════════════╪═══════╡
│ 08:40:10     ┆ 112   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 08:59:00     ┆ 113   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 19:31:26     ┆ 134   │
└──────────────┴───────┘
>>> df_b
shape: (6, 2)
┌──────────────┬─────────────────┐
│ arrival_time ┆ period          │
│ ---          ┆ ---             │
│ time         ┆ str             │
╞══════════════╪═════════════════╡
│ 08:00:00     ┆ early morning   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 08:30:00     ┆ mid-morning     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 09:00:00     ┆ late morning    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 16:00:00     ┆ early afternoon │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 16:30:00     ┆ mid afternoon   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 17:00:00     ┆ late afternoon  │
└──────────────┴─────────────────┘

We see that train 113 arrives closer to late morning than to mid-morning. If you need to capture that, you can join with both backward and forward, and then choose which is closest:

(
    df_a
    .join_asof(df_b.with_column(pl.col('arrival_time').alias('early_time')), on="arrival_time", strategy="backward", suffix="_early")
    .join_asof(df_b.with_column(pl.col('arrival_time').alias('late_time')), on="arrival_time", strategy="forward", suffix="_late")
    .with_column(
        pl.when(pl.col('early_time').is_null())
        .then(pl.col('period_late'))
        .when(pl.col('late_time').is_null())
        .then(pl.col('period'))
        .when((pl.col('arrival_time').cast(pl.Datetime) - pl.col('early_time').cast(pl.Datetime)) <
              (pl.col('late_time').cast(pl.Datetime) - pl.col('arrival_time').cast(pl.Datetime)))
        .then(pl.col('period'))
        .otherwise(pl.col('period_late'))
        .alias('closest')
    )
)

shape: (3, 7)
┌──────────────┬───────┬────────────────┬────────────┬──────────────┬───────────┬────────────────┐
│ arrival_time ┆ train ┆ period         ┆ early_time ┆ period_late  ┆ late_time ┆ closest        │
│ ---          ┆ ---   ┆ ---            ┆ ---        ┆ ---          ┆ ---       ┆ ---            │
│ time         ┆ i64   ┆ str            ┆ time       ┆ str          ┆ time      ┆ str            │
╞══════════════╪═══════╪════════════════╪════════════╪══════════════╪═══════════╪════════════════╡
│ 08:40:10     ┆ 112   ┆ mid-morning    ┆ 08:30:00   ┆ late morning ┆ 09:00:00  ┆ mid-morning    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 08:59:00     ┆ 113   ┆ mid-morning    ┆ 08:30:00   ┆ late morning ┆ 09:00:00  ┆ late morning   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 19:31:26     ┆ 134   ┆ late afternoon ┆ 17:00:00   ┆ null         ┆ null      ┆ late afternoon │
└──────────────┴───────┴────────────────┴────────────┴──────────────┴───────────┴────────────────┘

Note that in the when/then/otherwise I've chosen to explicitly handle the case where a train arrives before the first period or after the last period.

That's interesting, thanks, just a thought , when it is btw the first and the last i get this: arrival_time period early_time p_late late_time closest 07:18:22 "mid night" 23:30:00 null null "mid night" 07:19:55 "mid night" 23:30:00 null null "mid night" 07:25:51 "mid night" 23:30:00 null null "mid night" is there a way to approach this? thank youy anyway for the answer, love polars — luis.martinez.pro, Jun 22 '22 at 09:27
I cannot seem to replicate your results.. Are your DataFrames sorted by arrival_time? From the documentation: "Both DataFrames must be sorted by the asof_join key." If they are not sorted, you can get strange results. — , Jun 22 '22 at 16:11

Compare 2 tables in Polars and select a value based on that comparison

2 Answers2