In our cluster we have Kafka 0.10.1 and Spark 2.1.0. The spark streaming application works fine with checkpointing mechanism (checkpoints on HDFS). However, we noticed that, using checkpoints the Streaming Application does not restart if there is a code change.
Probing the Spark Streaming documentation- Storing offsets on Kafka :
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself, which says :
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
Following this I modified our code like following:
val kafkaMap:Map[String,Object] = KakfaConfigs
val stream:InputDStream[ConsumerRecord[String,String]] = KafkaUtil.createDirectStream(ssc, PreferConsistent, Subscribe[String,String] (Array("topicName"),kafkaMap))
stream.foreach { rdd =>
val offsetRangers : Array[OffsetRanger] = rdd.asInstanceOf[HasOffsetRangers].offsetRanges
// Filter out the values which have empty values and get the tuple of type
// ( topicname, stringValue_read_from_kafka_topic)
stream.map(x => ("topicName",x.value)).filter(x=> !x._2.trim.isEmpty).foreachRDD(processRDD _)
// Sometime later, after outputs have completed.
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
def processRDD(rdd:RDD[(String,String)]) {
// Process futher to hdfs
}
Now, When I try to start Streaming application, it does not start and looking at the logs, here is what we see :
java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
at org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:223)
at org.apache.spark.streaming.dstream.DStream.<init>(DStream.scala:65)
at org.apache.spark.streaming.dstream.MappedDStream.<init>(MappedDStream.scala:29)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:545)
Can someone please suggest if we are missing anything?