How to compute start and end times of user sessions?

Question

Is there a recommended way to do the following type of transform:
Input:

userID  timestamp  action  
u1      100        edit1
u1      122        edit2
u1      135        edit3
u2      143        edit4
u2      158        edit5
u1      212        edit6
u1      241        edit7

Output:

userID  startTime  endTime  actions
u1      100        135      [edit1, edit2, edit3]
u2      143        158      [edit4, edit5]
u1      212        241      [edit6, edit7]

Grouping by userID results in the session information being lost. Same with using Window functions.

EDIT for clarity: In this case, note that user u1 appears 2 times in the result set. The user u2's actions break up u1's actions into two sessions.

Another similar question answered: https://stackoverflow.com/questions/44020343/spark-dataframe-access-of-previous-calculated-row — U J, Dec 18 '17 at 18:41

score 1 · Answer 1 · edited Dec 15 '17 at 19:24

1

Aggregate all columns using good ol' SQL:

SELECT userID min(timestamp), max(timestamp), collect_list(actions)
FROM df GROUP BY userID

or Dataset API:

df.groupBy("userID").agg(
    min("timestamp") as "startTime",
    max("timestamp") as "endTime",
    collect_list("actions") as "actions")

edited Dec 15 '17 at 19:24

Jacek Laskowski

72,696
27
242
420

answered Dec 15 '17 at 19:16

user9105036

1
1

Sadly that would result in the following rows: `[(u1 100 241 [edit1, edit2, edit3, edit6, edit7]), (u2 143 158 [edit4, edit5] )]` and would not maintain separate sessions for the same user. u1 needs to have 2 rows which represent the two sessions. – U J Dec 15 '17 at 19:26

How to compute start and end times of user sessions?

1 Answers1