1

Is there a recommended way to do the following type of transform:
Input:

userID  timestamp  action  
u1      100        edit1
u1      122        edit2
u1      135        edit3
u2      143        edit4
u2      158        edit5
u1      212        edit6
u1      241        edit7

Output:

userID  startTime  endTime  actions
u1      100        135      [edit1, edit2, edit3]
u2      143        158      [edit4, edit5]
u1      212        241      [edit6, edit7]

Grouping by userID results in the session information being lost. Same with using Window functions.

EDIT for clarity: In this case, note that user u1 appears 2 times in the result set. The user u2's actions break up u1's actions into two sessions.

U J
  • 11
  • 3
  • Thanks @jaceklaskowski. The other question helped. – U J Dec 18 '17 at 18:29
  • Another similar question answered: https://stackoverflow.com/questions/44020343/spark-dataframe-access-of-previous-calculated-row – U J Dec 18 '17 at 18:41

1 Answers1

1

Aggregate all columns using good ol' SQL:

SELECT userID min(timestamp), max(timestamp), collect_list(actions)
FROM df GROUP BY userID

or Dataset API:

df.groupBy("userID").agg(
    min("timestamp") as "startTime",
    max("timestamp") as "endTime",
    collect_list("actions") as "actions")
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Sadly that would result in the following rows: `[(u1 100 241 [edit1, edit2, edit3, edit6, edit7]), (u2 143 158 [edit4, edit5] )]` and would not maintain separate sessions for the same user. u1 needs to have 2 rows which represent the two sessions. – U J Dec 15 '17 at 19:26