Is it possible to reference a different dataframe when using Polars expression without using Lambda?

Question

Is there a way to reference another Polars Dataframe in Polars expressions without using lambdas?

Just to use a simple example - suppose I have two dataframes:

df_1 = pl.DataFrame(
    {
        "time": pl.date_range(
            low=date(2021, 1, 1),
            high=date(2022, 1, 1),
            interval="1d",
        ),
        "x": pl.arange(0, 366, eager=True),
    }
)

df_2 = pl.DataFrame(
    {
        "time": pl.date_range(
            low=date(2021, 1, 1),
            high=date(2021, 2, 1),
            interval="1mo",
        ),
        "y": [50, 100],
    }
)

For each y value in df_2, I would like to find the maximum date in df_1, conditional on the x value being lower than the y.

I am able to perform this using apply/lambda (see below), but just wondering whether there is a more idiomatic way of performing this operation?

df_2.groupby("y").agg(
    pl.col("y").apply(lambda s: df_1.filter(pl.col("x") < s).select(pl.col("time")).max()[0,0]).alias('latest')
)

Edit:

Is it possible to pre-filter df_1 prior to using join_asof. So switching the question to look for the min instead of the max, on an individual case this is what I would do:

   (
    df_2
    .filter(pl.col('y') == 50)
    .join_asof(
                df_1
                .sort("x")
                .filter(pl.col('time') > date(2021,11,1))
                .select([
                    pl.col("time").cummin().alias("time_min"),
                    pl.col("x").alias("original_x"),
                    (pl.col("x") + 1).alias("x"),
                ]),
            left_on="y",
            right_on="x",
            strategy="forward",
    )
)

Is there a way to generalise this merge without using a loop / apply function?

score 1 · Accepted Answer · 2022-08-04T20:42:27.730

Edit: Generalizing a join

One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join.

I say "somewhat-dangerous" because the number of row combinations considered in a cross join is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.

If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).

(
    df_2.lazy()
    .join(
        df_1.lazy(),
        how="cross"
    )
    .filter(pl.col('time_right') >= pl.col('time'))
    .groupby('y')
    .agg([
        pl.col('time').first(),

        pl.col('x')
        .filter(pl.col('y') > pl.col('x'))
        .max()
        .alias('max(x) for(y>x)'),

        pl.col('time_right')
        .filter(pl.col('y') > pl.col('x'))
        .max()
        .alias('max(time_right) for(y>x)'),

        pl.col('time_right')
        .filter(pl.col('y') <= pl.col('x'))
        .filter(pl.col('time_right') > pl.col('time'))
        .min()
        .alias('min(time_right) for(two filters)'),
    ])
    .collect()
)

shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y   ┆ time       ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ ---        ┆ ---             ┆ ---                      ┆ ---                              │
│ i64 ┆ date       ┆ i64             ┆ date                     ┆ date                             │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99              ┆ 2021-04-10               ┆ 2021-04-11                       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 50  ┆ 2021-01-01 ┆ 49              ┆ 2021-02-19               ┆ 2021-02-20                       │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘

Couple of suggestions:

I strongly recommend running the cross-join in Lazy mode.
Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later groupby step.

Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof (which did solve the original sample question). But if you need the flexibility beyond what a join_asof can provide, the cross-join will provide ultimate flexibility -- at a cost.

`join_asof`

We can use a join_asof to accomplish this, with two wrinkles.

The Algorithm

(
    df_2
    .sort("y")
    .join_asof(
        (
            df_1
            .sort("x")
            .select([
                pl.col("time").cummax().alias("time_max"),
                (pl.col("x") + 1),
            ])
        ),
        left_on="y",
        right_on="x",
        strategy="backward",
    )
    .drop(['x'])
)

shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time       ┆ y   ┆ time_max   │
│ ---        ┆ --- ┆ ---        │
│ date       ┆ i64 ┆ date       │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50  ┆ 2021-02-19 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘

This matches the output of your code.

In steps

Let's add some extra information to our query, to elucidate how it works.

(
    df_2
    .sort("y")
    .join_asof(
        (
            df_1
            .sort("x")
            .select([
                pl.col("time").cummax().alias("time_max"),
                pl.col("x").alias("original_x"),
                (pl.col("x") + 1).alias("x"),
            ])
        ),
        left_on="y",
        right_on="x",
        strategy="backward",
    )
)

shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time       ┆ y   ┆ time_max   ┆ original_x ┆ x   │
│ ---        ┆ --- ┆ ---        ┆ ---        ┆ --- │
│ date       ┆ i64 ┆ date       ┆ i64        ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50  ┆ 2021-02-19 ┆ 49         ┆ 50  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99         ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘

Getting the maximum date

Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x or any lesser value of x, we can use a simpler approach: sort df_2 by x and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2 and be certain that for any x, we are getting the maximum date for that x and all lesser values of x. The cumulative maximum is displayed above as time_max.

less-than (and not less-than-or-equal-to)

From the documentation for join_as:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

Since you want "less than" and not "less than or equal to", we can simply increase each value of x by 1. Since x and y are integers, this will work. The result above displays both the original value of x (original_x), and the adjusted value (x) used in the join_asof.

If x and y are floats, you can add an arbitrarily small amount to x (e.g., x + 0.000000001) to force the non-equality.

I should mention: the above algorithm assumes that the values of `x` in `df_2` are unique. If not, then we have to take an additional step before the `cummax` step: calculating the maximum value of `time` for each `x`, then run `cummax`. — , Aug 04 '22 at 15:48
I should also add: I’m assuming there are potential gaps in the `x` values so that there may not be an exact match for (`y`-1) in `df_2`. If you’re guaranteed that all integer values of `x` are present, the `join_asof` can be reduced to a simple `join`. — , Aug 04 '22 at 16:15
Thanks. Is it possible to add in a second condition that the time in `df_1` must be greater than `df_2`. I.e. if `df_2 = pl.DataFrame( { "time": pl.date_range( low=date(2021, 11, 1), high=date(2021, 12, 1), interval="1mo", ), "y": [50, 100], } )` - then performing the operation above would give `time_max` < `time` in the code above. The only way I can think of resovling would be via a for loop, so would be interested in whether there is an alternative? — Scout, Aug 04 '22 at 16:20
Not sure I understand. In your example in the comment above, for `y` == 50, the maximum value of `time` in `df_1` (for `x` < `y`) is `2021-02-19`, which is less than `time` in `df_2` (`2021-11-01`). So what should happen then? What value are you looking for in that case? — , Aug 04 '22 at 18:12
Sorry, I just realised I have been unclear, will edit the post to clarify. thanks — Scout, Aug 04 '22 at 19:16
I've updated my answer to demonstrate using a "cross join". I steer clear of them when I can - but if you need them, they are available. — , Aug 04 '22 at 20:44