2

I have a problem I'm trying to solve but can't figure it out.

I have something similar to this table:

data = pl.DataFrame(
    {'id': [1,1,1,1,2,2,2,3,3],
     'date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-05-01', '2023-02-01', '2023-03-01', '2023-03-01', '2023-01-01', '2023-02-01'],
     'sales': [10, 20, 30, 40, 50, 60.6, 70.2, 80.5, 90]
     }
)

which gives this dataframe:

shape: (9, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ sales │
│ --- ┆ ---        ┆ ---   │
│ i64 ┆ str        ┆ f64   │
╞═════╪════════════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 10.0  │
│ 1   ┆ 2023-02-01 ┆ 20.0  │
│ 1   ┆ 2023-03-01 ┆ 30.0  │
│ 1   ┆ 2023-05-01 ┆ 40.0  │
│ 2   ┆ 2023-02-01 ┆ 50.0  │
│ 2   ┆ 2023-03-01 ┆ 60.6  │
│ 2   ┆ 2023-04-01 ┆ 70.2  │
│ 3   ┆ 2023-01-01 ┆ 80.5  │
│ 3   ┆ 2023-02-01 ┆ 90.0  │
└─────┴────────────┴───────┘

as you can see, for each id i have dates and sales. I want to get for every id all dates from the minimum date in the data frame up to the maximum(included).

in addition, i want to fill in the sales column with 0 and the id column with the matching id, so it looks like this:

id date sales
1 2023-01-01 10
1 2023-02-01 20
1 2023-03-01 30
1 2023-04-01 0
1 2023-05-01 40
2 2023-01-01 0
2 2023-02-01 50
2 2023-03-01 60.6
2 2023-04-01 70.2
2 2023-05-01 0

so on and so forth.

i've tried to create a new dataframe by using the pl.date_range function and then to join it against the main data, by using outer or cross, but to no avail, since it doesn't compute against each id.

maybe you have any ideas on how to go about it?

many thanks in advance for any input!

nam0_0
  • 59
  • 4

2 Answers2

2

You can select the unique ids along with the date range:

df = df.with_columns(pl.col("date").str.to_date("%Y-%d-%m"))

(df.select(
    pl.col("id").unique(),
    pl.date_range(
       pl.col("date").min(),
       pl.col("date").max()
    )
)
.explode("date")
.join(df, how="left", on=["id", "date"])
)
shape: (16, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ sales │
│ --- ┆ ---        ┆ ---   │
│ i64 ┆ date       ┆ f64   │
╞═════╪════════════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 10.0  │
│ 1   ┆ 2023-01-02 ┆ 20.0  │
│ 1   ┆ 2023-01-03 ┆ 30.0  │
│ 1   ┆ 2023-01-04 ┆ null  │
│ 1   ┆ 2023-01-05 ┆ 40.0  │
│ 2   ┆ 2023-01-01 ┆ null  │
│ 2   ┆ 2023-01-02 ┆ 50.0  │
│ 2   ┆ 2023-01-03 ┆ 60.6  │
│ 2   ┆ 2023-01-03 ┆ 70.2  │
│ 2   ┆ 2023-01-04 ┆ null  │
│ 2   ┆ 2023-01-05 ┆ null  │
│ 3   ┆ 2023-01-01 ┆ 80.5  │
│ 3   ┆ 2023-01-02 ┆ 90.0  │
│ 3   ┆ 2023-01-03 ┆ null  │
│ 3   ┆ 2023-01-04 ┆ null  │
│ 3   ┆ 2023-01-05 ┆ null  │
└─────┴────────────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • what an elegant solution! for some reason, with my use case, it duplicated a lot of rows, when joining the two frames. just had to deduplicate it. many thanks! – nam0_0 Jul 24 '23 at 10:45
1

Try:

def fn(x, r):
    return (
        pl.DataFrame({"id": x["id"][0], "date": r})
        .join(x, on="date", how="left")[["id", "date", "sales"]]
        .with_columns(pl.col("sales").fill_null(strategy="zero"))
    )


# convert "date" to datetime
data = data.with_columns(pl.col("date").str.to_datetime("%Y-%d-%m"))

# get min,  max date
mn, mx = data["date"].min(), data["date"].max()

# construct the range
r = pl.date_range(mn, mx, "1d", eager=True)

# group by "id" and fill the missing dates
data = data.groupby("id", maintain_order=True).apply(lambda x: fn(x, r))

with pl.Config(tbl_rows=-1):
    print(data)

Prints:

shape: (15, 3)
┌─────┬─────────────────────┬───────┐
│ id  ┆ date                ┆ sales │
│ --- ┆ ---                 ┆ ---   │
│ i64 ┆ datetime[μs]        ┆ f64   │
╞═════╪═════════════════════╪═══════╡
│ 1   ┆ 2023-01-01 00:00:00 ┆ 10.0  │
│ 1   ┆ 2023-01-02 00:00:00 ┆ 20.0  │
│ 1   ┆ 2023-01-03 00:00:00 ┆ 30.0  │
│ 1   ┆ 2023-01-04 00:00:00 ┆ 0.0   │
│ 1   ┆ 2023-01-05 00:00:00 ┆ 40.0  │
│ 2   ┆ 2023-01-01 00:00:00 ┆ 0.0   │
│ 2   ┆ 2023-01-02 00:00:00 ┆ 50.0  │
│ 2   ┆ 2023-01-03 00:00:00 ┆ 60.6  │
│ 2   ┆ 2023-01-04 00:00:00 ┆ 70.2  │
│ 2   ┆ 2023-01-05 00:00:00 ┆ 0.0   │
│ 3   ┆ 2023-01-01 00:00:00 ┆ 80.5  │
│ 3   ┆ 2023-01-02 00:00:00 ┆ 90.0  │
│ 3   ┆ 2023-01-03 00:00:00 ┆ 0.0   │
│ 3   ┆ 2023-01-04 00:00:00 ┆ 0.0   │
│ 3   ┆ 2023-01-05 00:00:00 ┆ 0.0   │
└─────┴─────────────────────┴───────┘
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks!, that's a nifty solution! I've some hard time to wrap my head around the x['id'][0] expression. intuitively, I thought it's going to spit out `1`, since `print(data['id'][0])` prints `1`. but when used with apply, it works like magic. any idea what am i missing? – nam0_0 Jul 24 '23 at 11:37
  • @nam0_0 The parameter `x` in function `fn` is a group (polars dataframe), so for each group the function `fn` is called with new dataframe. The `x['id'][0]` simply gets the first value of `id` in that group-dataframe and constructs new dataframe with correct data_range. Then it joins the values into that new dataframe. – Andrej Kesely Jul 24 '23 at 11:39
  • 1
    ohh ok, got you! very creative! many thanks for the explanation! – nam0_0 Jul 24 '23 at 11:59