I have a Spark Streaming application running which uses mapWithState function to track state of RDD. The application runs fine for few minutes but then crashes with
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 373
I observed that Memory usage of Spark application increases over time linearly even though i have set the timeout for mapWithStateRDD. Please see the code snippet below and memory usage -
val completedSess = sessionLines
.mapWithState(StateSpec.function(trackStateFunction _)
.numPartitions(80)
.timeout(Minutes(5)))
Why should the memory increase linearly over time if there is an explicit timeout for each RDD ?
I have tried increasing the memory but it does not matter. What am i missing ?
Edit - Code for reference
def trackStateFunction(batchTime: Time, key: String, value: Option[String], state: State[(Boolean, List[String], Long)]): Option[(Boolean, List[String])] ={
def updateSessions(newLine: String): Option[(Boolean, List[String])] = {
val currentTime = System.currentTimeMillis() / 1000
if (state.exists()) {
val newLines = state.get()._2 :+ newLine
//check if end of Session reached.
// if yes, remove the state and return. Else update the state
if (isEndOfSessionReached(value.getOrElse(""), state.get()._4)) {
state.remove()
Some(true, newLines)
}
else {
val newState = (false, newLines, currentTime)
state.update(newState)
Some(state.get()._1, state.get()._2)
}
}
else {
val newState = (false, List(value.get), currentTime)
state.update(newState)
Some(state.get()._1, state.get()._2)
}
}
value match {
case Some(newLine) => updateSessions(newLine)
case _ if state.isTimingOut() => Some(true, state.get()._2)
case _ => {
println("Not matched to any expression")
None
}
}
}