Filter polars dataframe using given time intervals

Question

Saying I have 2 dataframes:

discount_dates

product	discount_start_date	discount_end_date
Iphone	2012-12-31	2013-03-03
Iphone	2017-01-31	2019-12-03
Macbook	2013-02-22	2013-03-03
Iwatch	2012-12-31	2013-03-03
Iwatch	2013-12-31	2014-03-03
Iwatch	2014-12-31	2015-03-03

data_product

product	date	n_sales
Iphone	2013-03-01	1432
Iphone	2013-03-02	2314
Iphone	2013-03-03	3200
Iphone	2013-03-04	123
Iphone	2017-02-04	2230
Iwatch	2013-04-01	102
Iwatch	2013-04-02	12
Iwatch	2013-04-03	234

What I want is to compose a third dataframe based on data_product containing only rows where date is a discount date therefore contained between discount_start_date and discount_end_date of the discount_dates df.

result

product	date	n_sales
Iphone	2013-03-01	1432
Iphone	2013-03-02	2314
Iphone	2013-03-03	3200
Iphone	2017-02-04	2230

my approach in polars was the following:

import polars as pl

discount_dates = {
    "product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
    "discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
    "discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}

discount_dates = pl.DataFrame(discount_dates)

data_product = {
    "product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
    "date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
    "n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}

data_product = pl.DataFrame(data_product)

discount_dates = discount_dates.groupby("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))

Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product ends up being equal to data_product. Is there a smart way to tackle this in polars?

score 1 · Accepted Answer · answered Jul 24 '23 at 16:44

1

You have the right approach, just you don't need to do any groupby at all in this operation. You want to join every possible discount window here.

I got the desired DataFrame with the code above by simply commenting out discount_dates = ... (and you can select just those three columns from there).

answered Jul 24 '23 at 16:44

Wayoshi

1,688
1
7

Wow! it was that simple actually. Thanks! – Roberto Landi Jul 25 '23 at 07:26

score 0 · Answer 2 · answered Jul 24 '23 at 18:09

You can try (pl_df1 contains discount dates and pl_df2 contains product):


# convert columns to datetime (if necessary):
pl_df1 = pl_df1.with_columns(
    pl.col("discount_start_date").str.to_datetime("%Y-%m-%d"),
    pl.col("discount_end_date").str.to_datetime("%Y-%m-%d"),
)

pl_df2 = pl_df2.with_columns(pl.col("date").str.to_datetime("%Y-%m-%d"))


# make new dataframe where we store all discount dates for each product in a list
discount_days_agg = pl_df1.groupby("product").agg(
    pl.apply(
        ["discount_start_date", "discount_end_date"],
        lambda x: pl.Series(pl.date_range(s, e, eager=True) for s, e in zip(*x)),
    )
)

# filter df2 with `discount_days_agg`
filtered_df2 = pl_df2.filter(
        pl.struct(["product", "date", "n_sales"]).apply(
            lambda x: any(
                x["date"] in dr
                for dr in discount_days_agg.row(
                    by_predicate=(pl.col("product") == x["product"])
                )[1]
            )
        )
)

print(filtered_df2)

Prints:

shape: (4, 3)
┌─────────┬─────────────────────┬─────────┐
│ product ┆ date                ┆ n_sales │
│ ---     ┆ ---                 ┆ ---     │
│ str     ┆ datetime[μs]        ┆ i64     │
╞═════════╪═════════════════════╪═════════╡
│ Iphone  ┆ 2013-03-01 00:00:00 ┆ 1432    │
│ Iphone  ┆ 2013-03-02 00:00:00 ┆ 2314    │
│ Iphone  ┆ 2013-03-03 00:00:00 ┆ 3200    │
│ Iphone  ┆ 2017-02-04 00:00:00 ┆ 2230    │
└─────────┴─────────────────────┴─────────┘

Filter polars dataframe using given time intervals

2 Answers2