Is it possible to create a Window function that can have multiple conditions in orderBy for rangeBetween or rowsBetween. Assume I have a data frame like below.
user_id timestamp date event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0
0040b5f0 2018-01-25 18:55:08 2018-01-25 1
0040b5f0 2018-01-25 18:56:17 2018-01-25 1
0040b5f0 2018-01-25 20:51:43 2018-01-25 1
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1
Per each row, I need the sum of event column values that have date no longer than 3 days. But I cannot take sum events that happened later on the same date. I can create a window function like:
days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), 0)
But this will include events that happened later on the same date. I need to create a window function that will act like (for a row with *):
user_id timestamp date event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1----|==============|
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 sum here all events
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 only within 3 days
* 0040b5f0 2018-01-25 18:56:17 2018-01-25 1----| |
0040b5f0 2018-01-25 20:51:43 2018-01-25 1===================|
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1
I tried to create something like:
days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), Window.currentRow)\
.orderBy(F.col("t_stamp"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
But it only reflects the last orderBy.
The result table should look like:
user_id timestamp date event event_last_3d
0040b5f0 2018-01-22 13:04:32 2018-01-22 1 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 1
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 2
0040b5f0 2018-01-25 18:56:17 2018-01-25 1 3
0040b5f0 2018-01-25 20:51:43 2018-01-25 1 4
0040b5f0 2018-01-31 07:48:43 2018-01-31 1 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0 1
0040b5f0 2018-02-02 09:40:58 2018-02-02 1 2
0040b5f0 2018-02-02 09:41:01 2018-02-02 0 2
0040b5f0 2018-02-05 14:03:27 2018-02-05 1 2
I've been stuck on this one for some time, I would appreciate any advice on how to approach it.