0

You have two tables, with Table 1 containing low-frequency data and Table 2 containing high-frequency data.

You want to fill the corresponding low-frequency data into the high-frequency data table.

The requirement is to match the dates in the event_date column of Table 2 with the dates between the start_date and end_date in Table 1.

If a date falls within this range, the corresponding row from the low-frequency data should be inserted into the corresponding row of the high-frequency data, ensuring that the type in both tables aligns.

How should I use Polars to do this?

Table 1:

shape: (21_803_347, 7)
┌─────────┬────────────┬──────┬─────────┬─────────┬────────────┬────────────┐
│ type    ┆ year       ┆ qtr  ┆ rating1 ┆ rating2 ┆ start_date ┆ end_date   │
│ ---     ┆ ---        ┆ ---  ┆ ---     ┆ ---     ┆ ---        ┆ ---        │
│ str     ┆ i16        ┆ i8   ┆ f64     ┆ f64     ┆ date       ┆ date       │
╞═════════╪════════════╪══════╪═════════╪═════════╪════════════╪════════════╡
│ 000001  ┆ 1995       ┆ 4    ┆ null    ┆ 0.04    ┆ 1996-03-14 ┆ 1996-08-28 │
│ 000001  ┆ 1996       ┆ 4    ┆ null    ┆ 0.04    ┆ 1997-04-21 ┆ 1997-08-27 │
│ 000001  ┆ 1997       ┆ 2    ┆ null    ┆ 0.03    ┆ 1997-08-28 ┆ 1998-03-11 │
│ 000001  ┆ 1997       ┆ 4    ┆ null    ┆ 0.02    ┆ 1998-03-12 ┆ 1998-08-25 │
│ 000001  ┆ 1998       ┆ 2    ┆ null    ┆ 0.05    ┆ 1998-08-26 ┆ 1999-04-22 │
│ …       ┆ …          ┆ …    ┆ …       ┆ …       ┆ …          ┆ …          │
│ 600883  ┆ 1998       ┆ 2    ┆ null    ┆ null    ┆ 1998-08-14 ┆ 1999-03-22 │
│ 000792  ┆ 2000       ┆ 2    ┆ null    ┆ null    ┆ 2000-08-22 ┆ 2001-03-13 │
│ 000973  ┆ 2023       ┆ 1    ┆ null    ┆ null    ┆ 2023-04-29 ┆ 2023-08-26 │
│ 300019  ┆ 2009       ┆ 4    ┆ null    ┆ null    ┆ 2010-02-10 ┆ 2010-04-25 │
│ 600919  ┆ 2016       ┆ 4    ┆ null    ┆ null    ┆ 2017-03-20 ┆ 2017-04-26 │
└─────────┴────────────┴──────┴─────────┴─────────┴────────────┴────────────┘

Table 2:

shape: (14_061_391, 5)
┌─────────┬────────────┬─────────┬─────────┬─────────┐
│ type    ┆ event_date ┆ rating1 ┆ rating2 ┆ rating3 │
│ ---     ┆ ---        ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ date       ┆ f64     ┆ f64     ┆ f64     │
╞═════════╪════════════╪═════════╪═════════╪═════════╡
│ 000001  ┆ 1991-06-26 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-06-27 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-06-28 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-07-01 ┆ null    ┆ 0.008   ┆ null    │
│ 000001  ┆ 1991-07-02 ┆ null    ┆ 0.008   ┆ null    │
│ …       ┆ …          ┆ …       ┆ …       ┆ …       │
│ 900957  ┆ 2023-04-24 ┆ 0.259   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-25 ┆ 0.234   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-26 ┆ 0.255   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-27 ┆ 0.238   ┆ 0.035   ┆ 0.015   │
│ 900957  ┆ 2023-04-28 ┆ 0.234   ┆ 0.035   ┆ 0.015   │
└─────────┴────────────┴─────────┴─────────┴─────────┘
  • 3
    If it is a `1-to-1` mapping of `event_date -> interval`, you can `.join_asof`: https://stackoverflow.com/a/73177515 else it's a non-equi/range/interval join which polars doesn't have as of yet. In that case you can add a cross-join: https://stackoverflow.com/a/75442073 – jqurious May 17 '23 at 17:15

1 Answers1

2

It is probably easiest to do a left join on just the type column, and then apply a filter for the date handling. To make this efficient, you can use lazy mode:

df1.lazy().join(df2.lazy(), on="type", how="left").filter(pl.col("event_date").is_between("start_date", "end_date")).collect()
jvz
  • 1,183
  • 6
  • 13