We are building a fault tolerant system that can read form Kafka and write HBase & HDFS. The batch runs every 5 seconds. Here's the scenario we were hoping to setup:
Start new spark streaming process with checkpointing enabled, read from kafka, process data and store to HDFS and HBase
Kill the spark streaming job, messages continue to flow into Kafka
Restart spark streaming job, and here is what we really want to happen: Spark streaming reads the checkpoint data and restarts with the correct kafka offsets. No kafka messages are skipped even though the spark streaming job was killed and restarted
This does not seem to work,the spark streaming job does not start(pasting the stack trace of the error below). The only way I can resubmit the job is to delete out the checkpoint directory. This of course means that all the checkpoint information is lost and the spark job starts reading only the new Kafka messages.
Is this supposed to work? and if yes, do I need to do something specific to get it to work?
Here's the sample code:
1) I am on Spark 1.6.2. Here's how I create the streaming context:
val ddqSsc = StreamingContext.getOrCreate(checkpointDir, () =>
createDDQStreamingContext(slideInterval.toLong, inputKafka, outputKafka, hbaseVerTableName, checkpointDir, baseRawHdfs, securityProtocol, groupID, zooKeeper, kafkaBrokers, hiveDBToLoad, hiveTableToLoad))
2) And here's the initial part of the function that the getOrCreate Calls:
def createDDQStreamingContext(slideInterval: Long, inputKafka: String, outputKafka: String, hbaseVerTableName: String, checkpointDir: String, baseRawHdfs: String, securityProtocol: String, groupID: String, zooKeeper: String, kafkaBrokers: String, hiveDBToLoad: String, hiveTableToLoad: String): StreamingContext = {
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(slideInterval))
//val sqlContext = new SQLContext(sc)
val sqlContext = new HiveContext(ssc.sparkContext)
import sqlContext.implicits._
ssc.checkpoint(checkpointDir)
val kafkaTopics = Set(inputKafka)
//Kafka parameters
var kafkaParams = Map[String, String]()
kafkaParams += ("bootstrap.servers" -> kafkaBrokers)
kafkaParams += ("zookeeper.connect" -> zooKeeper)
//Need this in a kerberos environment
kafkaParams += ("security.protocol" -> securityProtocol)
kafkaParams += ("sasl.kerberos.service.name" -> "kafka")
//WHAT IS THIS!!??
kafkaParams += ("group.id" -> groupID)
kafkaParams += ("key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParams += ("value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
val inputDataDstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
==================STACK TRACE====================
2017-04-03 11:27:27,047 ERROR [Driver] yarn.ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:573) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:417) at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor$$anonfun$createDDQStreamingContext$1.apply(ddqKafkaDataProcessor.scala:97) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor$$anonfun$createDDQStreamingContext$1.apply(ddqKafkaDataProcessor.scala:73) at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:700) at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:700) at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:714) at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:712) at org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:46) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) at org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:233) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:228) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:228) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:97) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:83) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:610) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606) at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606) at ... run in separate thread using org.apache.spark.util.ThreadUtils ... () at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:606) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor$.main(ddqKafkaDataProcessor.scala:402) at com.wellsfargo.eda.bigdata.dced.dataprocessor.ddqKafkaDataProcessor.main(ddqKafkaDataProcessor.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)