2

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.

A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.

I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.

I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.

Below is a code updateState function I am using :

val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
    val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
    if (state.isTimingOut()) {
      Option(null)
    } else {
      state.update(session)
      Some((key,value,session))
    }
  }

and the mapWithStae call :

def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
    val spec = StateSpec.function(updateState)
    spec.timeout(Duration(sessionTimeout))
    spec.numPartitions(192) 
    inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
  }

Finally I am applying a feature computing session foreach DStream , as defined below :

def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {  
    val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
    val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
    resultSession.features = features
    return resultSession
  }
zero323
  • 322,348
  • 103
  • 959
  • 935
ZianyD
  • 171
  • 2
  • 12
  • 1
    Some code could be useful... If we don't know how exactly you use mapWithState it is hard to give you any advice. – zero323 Aug 03 '16 at 10:36
  • 1
    Yes , thnx , just added the three functions that seem the most relevant . – ZianyD Aug 03 '16 at 11:49
  • Thanks. Could you clarify a couple of things? I understand that issue here is that session is growing with each batch, right? It is not like the number of fields in merged maps is fixed. It is not that you have a large number of events for some sessions in each window. – zero323 Aug 03 '16 at 12:20
  • Yes , the problem is that some sessions (not many) are getting many new records constantly (for each batch ) , which makes processing the new features take more time that the batch time . From spark UI, just one or two tasks are taking too long and making the application unstable . Hope the answer is clear . – ZianyD Aug 03 '16 at 12:36

0 Answers0