1

I two dataframes: one for store operation dates and one for transactions.

The store hours dataframe is like so

┌───────────┬─────────────────────┬─────────────────────┐
│ vendorId  ┆ opening time        ┆ closing time        │
│ ---       ┆ ---                 ┆ ---                 │
│ str       ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════════╪═════════════════════╪═════════════════════╡
╞═══════════╪═════════════════════╪═════════════════════╡
│ 115       ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
│ 1146      ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
│ 18143     ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
│ 19122     ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
│ …         ┆ …                   ┆ …                   │
│ 7744      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 09092     ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
│ 5801      ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
│ 2378      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└───────────┴─────────────────────┴─────────────────────┘

The transactions df looks something like this:

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 11    ┆ 2023-02-25 12:17:01 ┆ 10.0      ┆ 4        ┆ 13.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 115    ┆ 3     ┆ 2022-02-26 12:10:01 ┆ 12.0      ┆ 5        ┆ 19.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

My goal is for each vendor, I want to return all the product transactions which had at least one transactions within the stores hours. For example, for vendor 7744 it would be the following.

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

These products were sold within the opening and closing of the store while product 11 occurred after the closing date.

On the other hand for vendor 2378 it should return:

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.

The expected final df is

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

Is there an efficient way to accomplish this? Any help is appreciated.

user99463
  • 25
  • 4
  • In your example for 2378 you say that the second transaction was after the closing date but `2020-02-25 12:20:01` is before `2022-02-16 02:00:47` – Dean MacGregor May 31 '23 at 17:56
  • @DeanMacGregor You're right. That was a mistake on my part, it should have been 2022. – user99463 Jun 01 '23 at 17:34

3 Answers3

1

You can do a left join to put every opening / closing hours of the vendor in the transactions df, then filter accordingly:

df_transactions.join(df_hours, left_on="vendor", right_on="vendorId").filter(
    pl.col("saletime").is_between(pl.col("opening time"), pl.col("closing time"))
)
shape: (4, 8)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┬─────────────────────┬─────────────────────┐
│ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality ┆ opening time        ┆ closing time        │
│ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     ┆ ---                 ┆ ---                 │
│ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     ┆ datetime[μs]        ┆ datetime[μs]        │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╪═════════════════════╪═════════════════════╡
│ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
│ 2378   ┆ 20   ┆ 2020-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┴─────────────────────┴─────────────────────┘

(Can select out the opening/closing time columns from here if desired.)

Wayoshi
  • 1,688
  • 1
  • 7
0

In your question you say:

Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.

so I think you meant for your transactions to look something more like this:

txns=pl.from_repr(
 """shape: (6, 6)
    ┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
    │ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality │
    │ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     │
    │ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     │
    ╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
    │ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    │
    │ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    │
    │ 7744   ┆ 11   ┆ 2023-02-25 12:17:01 ┆ 10.0  ┆ 4     ┆ 13.0    │
    │ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
    │ 2378   ┆ 20   ┆ 2022-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
    │ 115    ┆ 3    ┆ 2022-02-26 12:10:01 ┆ 12.0  ┆ 5     ┆ 19.0    │
    └────────┴──────┴─────────────────────┴───────┴───────┴─────────┘""")

where the second 2378 row has a transaction time after the closing time from the hours df.

hours=pl.from_repr(
     """┌───────────┬─────────────────────┬─────────────────────┐
        │ vendorId  ┆ opening time        ┆ closing time        │
        │ ---       ┆ ---                 ┆ ---                 │
        │ str       ┆ datetime[μs]        ┆ datetime[μs]        │
        ╞═══════════╪═════════════════════╪═════════════════════╡
        │ 115       ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
        │ 1146      ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
        │ 18143     ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
        │ 19122     ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
        │ …         ┆ …                   ┆ …                   │
        │ 7744      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
        │ 09092     ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
        │ 5801      ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
        │ 2378      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
        └───────────┴─────────────────────┴─────────────────────┘""")

With that, if you want to return all the transactions where at least one transaction/vendor pair are within the open/close time even if the particular row isn't then you would use a window function (ie. .over) in your filter.

(
    txns
        .join(hours,  left_on="vendor", right_on="vendorId")
        .filter(
            (
                pl.col('saletime')
                .is_between(pl.col("opening time"), pl.col("closing time"))
                .any()
            ).over(['vendor','prod']))
        .select(txns.columns)
)

shape: (4, 6)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
│ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality │
│ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     │
│ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
│ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    │
│ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    │
│ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
│ 2378   ┆ 20   ┆ 2022-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
0

There is also .join_asof which is an optimized join for this type of task.

e.g. Python Polars join on column with greater or equal

df_transactions.join_asof(
   df_hours, 
   left_on="saletime", 
   right_on="opening time", 
   by_left="vendor", 
   by_right="vendorId",
).filter(
    pl.col("saletime")
      .is_between(pl.col("opening time"), pl.col("closing time"))
      .any()
      .over("vendor", "prod")
)
shape: (4, 8)
┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┬─────────────────────┬─────────────────────┐
│ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality ┆ opening time        ┆ closing time        │
│ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     ┆ ---                 ┆ ---                 │
│ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     ┆ datetime[μs]        ┆ datetime[μs]        │
╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╪═════════════════════╪═════════════════════╡
│ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
│ 2378   ┆ 20   ┆ 2022-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└────────┴──────┴─────────────────────┴───────┴───────┴─────────┴─────────────────────┴─────────────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14