Pyspark count on any sliding window for any ID

Question

I have a data frame of customer digital visit over time in the form:

|cust_id|datetime|
|1|2020-08-15 15:20|
|1|2020-08-15 16:20|
|1|2020-08-17 12:20|
|1|2020-08-19 14:20|
|1|2020-08-23 09:20|
|2|2020-08-24 08:00|

I'd like to pick out strong signals, as in : customers who visit at least 3 times in 5 days.

My initial thought is that we have to compute ALL sliding window for each customer.

In this example, let's take cust1 :

5-day window starting 2020-08-15, ending 2020-08-19, total visit is 4
5-day window starting 2020-08-16, ending 2020-08-20, total visit is 2
5-day window starting 2020-08-17, ending 2020-08-21, total visit is 2

etc.

Max count of all sliding window is 4. Therefore cust1 fits the criteria "having visited at least 3 times in 5 days"

This seems to be a costly operation.

How would you implement this efficiently ? Any other idea is welcome.

how often should this be done? if it has to be done on a regular basis, you can aggregate and store the previous 5-day visits data every day. — Vamsi Prabhala, Oct 09 '20 at 20:37
Both long and short duration basis. For historical insights, could be over a long time frame (and large customer base) to estimate the volume of each signal, let's say >=3 visits in the any 5 days, >= x visits over the any 3 weeks, 3-consecutive-day visits, etc. Then on production run, will be a shorter time frame, as in, today looking only at last 5 days, 3 weeks, etc. There's a caveat to analyze both "ANY 5 days window" for insights, and "LAST 5 days" for production. — Kenny, Oct 09 '20 at 20:55
How do you want your output data to be? What fields do you want to display? — Cena, Oct 09 '20 at 21:06

Cena · Accepted Answer · 2020-10-09T22:35:37.980

4

You can convert the datetime column to long and pass in the number of seconds equivalent to 5 days in the rangeBetween() function.

from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = df.withColumn("date_long", to_date(substring(col("datetime"),0,10), "yyyy-MM-dd"))\
        .withColumn("date_long", unix_timestamp('date_long', 'yyyy-MM-dd'))

days = lambda i: i * 86400 
w = (Window.partitionBy('cust_id').orderBy("date_long").rangeBetween(0,days(5)))

df.withColumn('5_day_visit', F.count("*").over(w)).drop('date_long').show()
+-------+----------------+-----------+                                          
|cust_id|        datetime|5_day_visit|
+-------+----------------+-----------+
|      1|2020-08-15 15:20|          4|
|      1|2020-08-15 16:20|          4|
|      1|2020-08-17 12:20|          2|
|      1|2020-08-19 14:20|          2|
|      1|2020-08-23 09:20|          1|
|      2|2020-08-24 08:00|          1|
+-------+----------------+-----------+

To get the maximum number of 5-day visits for each customer, you can do:

df.withColumn('5_day_visit', F.count("*").over(w)).drop('date_long')\
    .groupBy('cust_id').agg(F.max('5_day_visit').alias('max_5_day_visits')).show()
+-------+----------------+                                                      
|cust_id|max_5_day_visits|
+-------+----------------+
|      1|               4|
|      2|               1|
+-------+----------------+

edited Oct 09 '20 at 22:35

answered Oct 09 '20 at 21:08

Cena

3,316
2
17
34

Great answer @Cena. 1. Seems like the purpose of your **days** lambda is to have second-precision "exactly 24h after this moment". In that case, how did the second row also get 4 count ? 2. Any reason for second-precision ? If you'd like to look at day only, how would you do it. Please keep this second-precision here, for educational purpose. – Kenny Oct 09 '20 at 21:24
1

I'm not considering the time of visit, just the day. That's what you want, right? The `to_date(substring())` in the beginning strips off the timestamp and converts it to a `yyyy-MM-dd` format. I then convert this date to long. So, the `date_long` column for the first 2 visits of `cust_id=1` will be the same – Cena Oct 09 '20 at 21:31
So it is not an "exactly 24h after this moment" kind of precision. We can get that kind of precision if we don't strip off the time from `datetime` column. But in your case, its appropriate to remove the time, I think. – Cena Oct 09 '20 at 21:35
Indeed date-only is what I want. Must we convert into long ? Is there any direct way to look at it as timeseries date window, in the form **rangeBetween(0,5)** ? – Kenny Oct 09 '20 at 21:36
something like `rangeBetween(0, "5 days")` cannot be used in `rangeBetween`. I'm converting it to `long` so that I can pass numerical data type as the boundary. You can read more about [rangeBetween() in this link](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rangeBetween) – Cena Oct 09 '20 at 22:22
Just for my knowledge : How would we adapt the answer to "all 5 days window, not just dates that exist in entries (15,17,19, 23-Aug)". I am thinking of a dictionary of all dates, put visit count for each dates then do a rolling window. `cust_id| date |visit_count, 1| 15-Aug | 2, 1|16-Aug | 0, 1|17-Aug|1, 1|16-Aug | 0`, etc. Result after rolling : `cust_id | date | 5_days_visit , 1|15_Aug | 4, 1|16-Aug|2, 1|17-Aug|2`, etc. – Kenny Oct 12 '20 at 16:15

Pyspark count on any sliding window for any ID

1 Answers1