I am updating my code to switch from updateStateByKey
to mapWithState
in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.
This was my old code:
val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
val parsed = Utils.parseJSON(eventRecord)
val member_id = parsed.getOrElse("member_id", "")
val timestamp = parsed.getOrElse("timestamp", "").toLong
//The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
(member_id, (timestamp, timestamp, List(eventRecord)))
})
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
reduceByKey((a, b) => {
//transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
}).updateStateByKey(Utils.updateState)
The problems of updateStateByKey
are nicely explained here. One of the key reasons why I decided to use mapWithState
is because updateStateByKey
was unable to return finished sessions (the ones that have timed out) for further processing.
This is my first attempt to transform the old code to the new version:
val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()
I slightly misunderstand what shoud be the content of updateState
, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState
) and .snapshotStream
should return the timed-out sessions.