How can we generate a unique session id for click stream data by using Spark(Scala) dataframes with following two conditions?
- Session expires after 30 minutes of inactivity (Means no click stream data within 30 minutes)
- Session remains active for a total duration of 2 hours. After 2 hours, renew the session.
Input:
UserId | Click Time
-----------------------------
U1 | 2019-01-01T11:00:00Z
U1 | 2019-01-01T11:15:00Z
U1 | 2019-01-01T12:00:00Z
U1 | 2019-01-01T12:20:00Z
U1 | 2019-01-01T15:00:00Z
U2 | 2019-01-01T11:00:00Z
U2 | 2019-01-02T11:00:00Z
U2 | 2019-01-02T11:25:00Z
U2 | 2019-01-02T11:50:00Z
U2 | 2019-01-02T12:15:00Z
U2 | 2019-01-02T12:40:00Z
U2 | 2019-01-02T13:05:00Z
U2 | 2019-01-02T13:20:00Z
Expected Output
UserId | Click Time | SessionId
-----------------------------------------
U1 | 2019-01-01T11:00:00Z | Session1
U1 | 2019-01-01T11:15:00Z | Session1
U1 | 2019-01-01T12:00:00Z | Session2
U1 | 2019-01-01T12:20:00Z | Session2
U1 | 2019-01-01T15:00:00Z | Session3
U2 | 2019-01-01T11:00:00Z | Session4
U2 | 2019-01-02T11:00:00Z | Session5
U2 | 2019-01-02T11:25:00Z | Session5
U2 | 2019-01-02T11:50:00Z | Session5
U2 | 2019-01-02T12:15:00Z | Session5
U2 | 2019-01-02T12:40:00Z | Session5
U2 | 2019-01-02T13:05:00Z | Session6
U2 | 2019-01-02T13:20:00Z | Session6
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
val df = sc.parallelize(List(("U1","2019-01-01T11:00:00Z"),("U1","2019-01-01T11:15:00Z"),("U1","2019-01-01T12:00:00Z"),("U1","2019-01-01T12:20:00Z"),("U1","2019-01-01T15:00:00Z"),("U2","2019-01-01T11:00:00Z"),("U2","2019-01-02T11:00:00Z"),("U2","2019-01-02T11:25:00Z"),("U2","2019-01-02T11:50:00Z"),("U2","2019-01-02T12:15:00Z"),("U2","2019-01-02T12:40:00Z"),("U2","2019-01-02T13:05:00Z"),("U2","2019-01-02T13:20:00Z"))).toDF("UserId","time").withColumn("ClickTime", col("time").cast("timestamp")).drop(col("time"))
val windowSpec = Window.partitionBy("userid").orderBy("clicktime")
val lagWindow = lag(col("clicktime"), 1).over(windowSpec)
val df1 = df.select(col("userid"), col("clicktime"), lagWindow.alias("prevclicktime")).withColumn("timediff", (col("clicktime").cast("long") - col("prevclicktime").cast("long"))).na.fill(Map("timediff" -> 0)).drop(col("prevclicktime"))
df1.show(truncate = false)
I am unable to apply the 2nd condition as we need to accumulate the session duration and check accumulated value is less than 2 hours. If accumulated duration is greater than 2 hours then need to assign a new session.
Please help me on this.
Thanks in Advance...!