Edit: Generalizing a join
One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join
.
I say "somewhat-dangerous" because the number of row combinations considered in a cross join
is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.
If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).
(
df_2.lazy()
.join(
df_1.lazy(),
how="cross"
)
.filter(pl.col('time_right') >= pl.col('time'))
.groupby('y')
.agg([
pl.col('time').first(),
pl.col('x')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(x) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(time_right) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') <= pl.col('x'))
.filter(pl.col('time_right') > pl.col('time'))
.min()
.alias('min(time_right) for(two filters)'),
])
.collect()
)
shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y ┆ time ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ date ┆ date │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99 ┆ 2021-04-10 ┆ 2021-04-11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 50 ┆ 2021-01-01 ┆ 49 ┆ 2021-02-19 ┆ 2021-02-20 │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘
Couple of suggestions:
- I strongly recommend running the cross-join in Lazy mode.
- Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later
groupby
step.
Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof
(which did solve the original sample question). But if you need the flexibility beyond what a join_asof
can provide, the cross-join will provide ultimate flexibility -- at a cost.
join_asof
We can use a join_asof
to accomplish this, with two wrinkles.
The Algorithm
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
(pl.col("x") + 1),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
.drop(['x'])
)
shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time ┆ y ┆ time_max │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘
This matches the output of your code.
In steps
Let's add some extra information to our query, to elucidate how it works.
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
)
shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time ┆ y ┆ time_max ┆ original_x ┆ x │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date ┆ i64 ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 ┆ 49 ┆ 50 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99 ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘
Getting the maximum date
Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x
or any lesser value of x
, we can use a simpler approach: sort df_2
by x
and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2
and be certain that for any x
, we are getting the maximum date for that x
and all lesser values of x
. The cumulative maximum is displayed above as time_max
.
less-than (and not less-than-or-equal-to)
From the documentation for join_as
:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
Since you want "less than" and not "less than or equal to", we can simply increase each value of x
by 1. Since x
and y
are integers, this will work. The result above displays both the original value of x
(original_x
), and the adjusted value (x
) used in the join_asof
.
If x
and y
are floats, you can add an arbitrarily small amount to x
(e.g., x + 0.000000001) to force the non-equality.