Generate a session id for click-stream data based on conditions in Apache Spark

Question

How can we generate a unique session id for click stream data by using Spark(Scala) dataframes with following two conditions?

Session expires after 30 minutes of inactivity (Means no click stream data within 30 minutes)
Session remains active for a total duration of 2 hours. After 2 hours, renew the session.

Input:

UserId |    Click Time
-----------------------------
  U1   | 2019-01-01T11:00:00Z
  U1   | 2019-01-01T11:15:00Z
  U1   | 2019-01-01T12:00:00Z
  U1   | 2019-01-01T12:20:00Z
  U1   | 2019-01-01T15:00:00Z
  U2   | 2019-01-01T11:00:00Z
  U2   | 2019-01-02T11:00:00Z
  U2   | 2019-01-02T11:25:00Z
  U2   | 2019-01-02T11:50:00Z
  U2   | 2019-01-02T12:15:00Z
  U2   | 2019-01-02T12:40:00Z
  U2   | 2019-01-02T13:05:00Z
  U2   | 2019-01-02T13:20:00Z

Expected Output

UserId |    Click Time        | SessionId
-----------------------------------------
  U1   | 2019-01-01T11:00:00Z | Session1
  U1   | 2019-01-01T11:15:00Z | Session1
  U1   | 2019-01-01T12:00:00Z | Session2
  U1   | 2019-01-01T12:20:00Z | Session2
  U1   | 2019-01-01T15:00:00Z | Session3
  U2   | 2019-01-01T11:00:00Z | Session4
  U2   | 2019-01-02T11:00:00Z | Session5
  U2   | 2019-01-02T11:25:00Z | Session5
  U2   | 2019-01-02T11:50:00Z | Session5
  U2   | 2019-01-02T12:15:00Z | Session5
  U2   | 2019-01-02T12:40:00Z | Session5
  U2   | 2019-01-02T13:05:00Z | Session6
  U2   | 2019-01-02T13:20:00Z | Session6

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col

val df = sc.parallelize(List(("U1","2019-01-01T11:00:00Z"),("U1","2019-01-01T11:15:00Z"),("U1","2019-01-01T12:00:00Z"),("U1","2019-01-01T12:20:00Z"),("U1","2019-01-01T15:00:00Z"),("U2","2019-01-01T11:00:00Z"),("U2","2019-01-02T11:00:00Z"),("U2","2019-01-02T11:25:00Z"),("U2","2019-01-02T11:50:00Z"),("U2","2019-01-02T12:15:00Z"),("U2","2019-01-02T12:40:00Z"),("U2","2019-01-02T13:05:00Z"),("U2","2019-01-02T13:20:00Z"))).toDF("UserId","time").withColumn("ClickTime", col("time").cast("timestamp")).drop(col("time"))
val windowSpec = Window.partitionBy("userid").orderBy("clicktime")
val lagWindow = lag(col("clicktime"), 1).over(windowSpec)
val df1 = df.select(col("userid"), col("clicktime"), lagWindow.alias("prevclicktime")).withColumn("timediff", (col("clicktime").cast("long") - col("prevclicktime").cast("long"))).na.fill(Map("timediff" -> 0)).drop(col("prevclicktime"))
df1.show(truncate = false)

I am unable to apply the 2nd condition as we need to accumulate the session duration and check accumulated value is less than 2 hours. If accumulated duration is greater than 2 hours then need to assign a new session.

Please help me on this.

Thanks in Advance...!

Already answered here : https://stackoverflow.com/questions/54662219/spark-advanced-window-with-dynamic-last — Anurag Yadav, Feb 03 '21 at 13:52

score 1 · Answer 1 · answered Jan 27 '21 at 01:22

val df = spark.read.option("header", true).option("delimiter","|").csv("dataset").select(col("user_id"), col("click_time").cast("timestamp"))
    df
      .withColumn("lag_click_time", lag("click_time", 1).over(Window.partitionBy("user_id").orderBy("click_time")))
      .withColumn("time_diff", ((col("click_time").cast("long") - col("lag_click_time").cast("long")) / (60*30)))
      .na.fill(0)
      .withColumn("is_new_session", when(col("time_diff") > 1, 1).otherwise(0) )
      .withColumn("temp_session_id", sum(col("is_new_session")).over(Window.partitionBy("user_id").orderBy("click_time")))
      .withColumn("first_click_time", first(col("click_time")).over(Window.partitionBy("user_id", "temp_session_id").orderBy("click_time")))
      .withColumn("time_diff2", ((col("click_time").cast("long") - col("first_click_time").cast("long"))/(60 * 60 * 2)).cast("int"))
      .withColumn("session_id", (col("time_diff2") + col("temp_session_id")))
      .drop("lag_click_time","time_diff", "is_new_session", "temp_session_id", "first_click_time", "time_diff2" )
      .show()

Give proper explanation of solution provided with some comments. — MSS, Jan 27 '21 at 06:17

Generate a session id for click-stream data based on conditions in Apache Spark

1 Answers1