2

Saying I have 2 dataframes:

discount_dates

product discount_start_date discount_end_date
Iphone 2012-12-31 2013-03-03
Iphone 2017-01-31 2019-12-03
Macbook 2013-02-22 2013-03-03
Iwatch 2012-12-31 2013-03-03
Iwatch 2013-12-31 2014-03-03
Iwatch 2014-12-31 2015-03-03

data_product

product date n_sales
Iphone 2013-03-01 1432
Iphone 2013-03-02 2314
Iphone 2013-03-03 3200
Iphone 2013-03-04 123
Iphone 2017-02-04 2230
Iwatch 2013-04-01 102
Iwatch 2013-04-02 12
Iwatch 2013-04-03 234

What I want is to compose a third dataframe based on data_product containing only rows where date is a discount date therefore contained between discount_start_date and discount_end_date of the discount_dates df.

result

product date n_sales
Iphone 2013-03-01 1432
Iphone 2013-03-02 2314
Iphone 2013-03-03 3200
Iphone 2017-02-04 2230

my approach in polars was the following:

import polars as pl

discount_dates = {
    "product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
    "discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
    "discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}

discount_dates = pl.DataFrame(discount_dates)

data_product = {
    "product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
    "date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
    "n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}

data_product = pl.DataFrame(data_product)

discount_dates = discount_dates.groupby("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))

Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product ends up being equal to data_product. Is there a smart way to tackle this in polars?

2 Answers2

1

You have the right approach, just you don't need to do any groupby at all in this operation. You want to join every possible discount window here.

I got the desired DataFrame with the code above by simply commenting out discount_dates = ... (and you can select just those three columns from there).

Wayoshi
  • 1,688
  • 1
  • 7
0

You can try (pl_df1 contains discount dates and pl_df2 contains product):


# convert columns to datetime (if necessary):
pl_df1 = pl_df1.with_columns(
    pl.col("discount_start_date").str.to_datetime("%Y-%m-%d"),
    pl.col("discount_end_date").str.to_datetime("%Y-%m-%d"),
)

pl_df2 = pl_df2.with_columns(pl.col("date").str.to_datetime("%Y-%m-%d"))


# make new dataframe where we store all discount dates for each product in a list
discount_days_agg = pl_df1.groupby("product").agg(
    pl.apply(
        ["discount_start_date", "discount_end_date"],
        lambda x: pl.Series(pl.date_range(s, e, eager=True) for s, e in zip(*x)),
    )
)

# filter df2 with `discount_days_agg`
filtered_df2 = pl_df2.filter(
        pl.struct(["product", "date", "n_sales"]).apply(
            lambda x: any(
                x["date"] in dr
                for dr in discount_days_agg.row(
                    by_predicate=(pl.col("product") == x["product"])
                )[1]
            )
        )
)

print(filtered_df2)

Prints:

shape: (4, 3)
┌─────────┬─────────────────────┬─────────┐
│ product ┆ date                ┆ n_sales │
│ ---     ┆ ---                 ┆ ---     │
│ str     ┆ datetime[μs]        ┆ i64     │
╞═════════╪═════════════════════╪═════════╡
│ Iphone  ┆ 2013-03-01 00:00:00 ┆ 1432    │
│ Iphone  ┆ 2013-03-02 00:00:00 ┆ 2314    │
│ Iphone  ┆ 2013-03-03 00:00:00 ┆ 3200    │
│ Iphone  ┆ 2017-02-04 00:00:00 ┆ 2230    │
└─────────┴─────────────────────┴─────────┘
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91