Issue Storing offset in Kafka for Spark Streaming Application

Question

In our cluster we have Kafka 0.10.1 and Spark 2.1.0. The spark streaming application works fine with checkpointing mechanism (checkpoints on HDFS). However, we noticed that, using checkpoints the Streaming Application does not restart if there is a code change.

Probing the Spark Streaming documentation- Storing offsets on Kafka :

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself, which says :

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

Following this I modified our code like following:

val kafkaMap:Map[String,Object] = KakfaConfigs

val stream:InputDStream[ConsumerRecord[String,String]] = KafkaUtil.createDirectStream(ssc, PreferConsistent, Subscribe[String,String] (Array("topicName"),kafkaMap))

stream.foreach { rdd =>
    val offsetRangers : Array[OffsetRanger] = rdd.asInstanceOf[HasOffsetRangers].offsetRanges

    // Filter out the values which have empty values and get the tuple of type 
        // ( topicname, stringValue_read_from_kafka_topic)
    stream.map(x => ("topicName",x.value)).filter(x=> !x._2.trim.isEmpty).foreachRDD(processRDD _)

    // Sometime later, after outputs have completed.
    stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}


def processRDD(rdd:RDD[(String,String)]) {
 // Process futher to hdfs 
}

Now, When I try to start Streaming application, it does not start and looking at the logs, here is what we see :

java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
    at org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:223)
    at org.apache.spark.streaming.dstream.DStream.<init>(DStream.scala:65)
    at org.apache.spark.streaming.dstream.MappedDStream.<init>(MappedDStream.scala:29)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
    at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264)
    at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:545)

Can someone please suggest if we are missing anything?

Issue Storing offset in Kafka for Spark Streaming Application

0 Answers0