1

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.

This was my old code:

val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
  val parsed = Utils.parseJSON(eventRecord)
  val member_id = parsed.getOrElse("member_id", "")
  val timestamp = parsed.getOrElse("timestamp", "").toLong
  //The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
  (member_id, (timestamp, timestamp, List(eventRecord)))
})

val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
  //transform to (member_id, (time, time, counter, events within session))
  (a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
  reduceByKey((a, b) => {
    //transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
    (Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
  }).updateStateByKey(Utils.updateState)

The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.

This is my first attempt to transform the old code to the new version:

val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
  //transform to (member_id, (time, time, counter, events within session))
  (a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()

I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.

duckertito
  • 3,365
  • 2
  • 18
  • 25
  • How do you know if a session completes prior to timing out? Or does it always only time out? – Yuval Itzchakov Nov 24 '16 at 14:34
  • You have to update the state of the sessions you want to keep, and return the timed out sessions from the statespec function. The timeout option just removes stale data from cache. – ImDarrenG Nov 24 '16 at 14:46

1 Answers1

2

Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.

What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:

def updateState(key: String,
                value: Option[(Long, Long, Long, List[String])],
                state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
  def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
    (Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
  }

  value match {
    case Some(currentValue) =>
      val result = state
        .getOption()
        .map(currentState => reduce(currentState, currentValue))
        .getOrElse(currentValue)
      state.update(result)
      None
    case _ if state.isTimingOut() => state.getOption()
  }
}

This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.

This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:

latestSessionInfo
 .mapWithState(spec)
 .filter(_.isDefined)

After filter, you'll only have states which have timed out.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Answering your question in comments: I assume that the session is timed-out, if I don't receive any events for the member_id during last N=2 minutes. The value of N is always fixed. If the user will be active again after N=2 minutes, then this will be a new session. Of course in real life N will be equal to the value like 40-50 minutes. – duckertito Nov 24 '16 at 15:20
  • Is `filter(_.isDefined)` the same as `.snapshotStream()`? I want just to extract those sessions that are inactive during last 2 minutes (no steaming events) - this is what I call as time-out. – duckertito Nov 24 '16 at 15:21
  • @duckertito Ok, but does a user send a "I've finished my session" msg even though it hasn't timed out yet? – Yuval Itzchakov Nov 24 '16 at 15:23
  • @duckertito No, `filter(_.isDefined)` will filter the `DStream` and keep only the states that have timed out. If you look in the code, if we receive a value and pattern match it to a `Some(currentValue)`, we reduce and update the state, and then return `None`. `stateSnapshots` will give you an iterator containing all the states that are currently in-memory and have yet to time out, that's the opposite of what you want. – Yuval Itzchakov Nov 24 '16 at 15:24
  • Ok, let me test your solution for my case, and I will let you know the result. – duckertito Nov 24 '16 at 18:08
  • I have `Task serialization` error triggered by `val spec = StateSpec.function(updateState _).timeout(Minutes(2))`. I also tried to put `updateState` into the `object Utils extends Serializable { def updateState(...) }`. Do you have any idea why does it happen? The content of `updateState` is the same as you published and it does not use any non-serializable variable. May it be the problem of `def reduce` inside `def updateState`? – duckertito Nov 28 '16 at 16:46
  • It also says `java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable` for `ssc.checkpoint("~/checkpoint")` – duckertito Nov 28 '16 at 16:48
  • That doesn't seem related. Seems like you have something non serializable inside one of your closures. – Yuval Itzchakov Nov 28 '16 at 16:56
  • I published the new question with the detailed code and error stacktrace. If you have time, I appreciate if you could take a look: http://stackoverflow.com/questions/40849850/task-serialization-error-when-using-mapwithstate – duckertito Nov 28 '16 at 17:12
  • @YuvalItzchakov: Hi, I am using your solution for the same task - grouping events by member_id if the session has timed out. However, I noticed quite strange behaviour - depending on the batch, in which an event arrives, it gets grouped in the same session or is interpreted as a new session. Cannot understand why it happens. If you find it interesting, you may take a look at this thread: http://stackoverflow.com/questions/41834596/issue-with-grouping-events-into-a-single-session-by-session-timeout – Dinosaurius Jan 24 '17 at 17:58
  • @Dinosaurius I see you deleted it, I'm assuming you worked it out. – Yuval Itzchakov Jan 24 '17 at 19:21
  • @YuvalItzchakov: Yes, I made careful debugging and finally found out that the issue was located in another part of the code. Sorry about that. It's not so easy to debug Spark Streaming processes. – Dinosaurius Jan 24 '17 at 19:26
  • @Dinosaurius Definitely. Setting up a local environment helps though. – Yuval Itzchakov Jan 24 '17 at 19:27