1

We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function used to create the Streaming context, we use createDirectStream to create our DStream and from this point, we execute several transformations and actions derived from call saveToCassandra on different RDDs

We are running different tests to establish how the application should recover when a failure occurs. Some key points about our scenario are:

  • We are testing with a fixed number of records in kafka (between 10 million and 20 million), that means, we consume from kafka once and the application brings all the records from kafka.

  • We are executing the application in --deploy-mode 'client' inside one of the workers, that means that we stop and start the driver manually.

We are not sure how to handle exceptions after DStreams were created, for example, if while writing to cassandra all nodes are dead, we get an exception that aborts the job, but after re-submitting the application, that job is not re-scheduled and the application keeps consuming from kafka getting multiple 'isEmpty' calls.

We made a couple of tests using 'cache' on the repartitioned RDD (which didn't work after a failure different than just stopping and starting the driver), and changing the parameters "query.retry.count", "query.retry.delay" and "spark.task.maxFailures" without success, e.g., the job is aborted after x failed times.

At this point we are confused on how should we use the checkpoint to re-schedule jobs after a failure.

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
naticos
  • 11
  • 2
  • When the driver fails, all the executors in the worker nodes are killed as well along with the data. To avoid the data loss of the received data write ahead logs should be enabled. Set spark.streaming.receiver.writeAheadLog.enable to true(by default false) and set the check point directory using streamingContext.checkpoint(path-to-directory) – nagendra Mar 16 '16 at 07:30
  • To find some more information about zero data loss have a look at the this [link](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html) – nagendra Mar 16 '16 at 07:33
  • We made the change in the configuration but didn't get the lost jobs to be regenerated. Talking with lightbend support we realized that we are creating a direct stream to connect with kafka [(here)](http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers). It puts: "This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. " – naticos Mar 17 '16 at 16:16
  • Maybe we are failing when setting the spark streaming configuration. This is how we create the streaming context `val ssc = StreamingContext.getOrCreate( checkpointPath = appConfig.checkpointDir, creatingFunc = creatingFunction() )` – naticos Mar 17 '16 at 16:21
  • And this is the creating function – naticos Mar 17 '16 at 16:21
  • `def creatingFunction(): () => StreamingContext = { () => import appConfig._ val streamingContext = new StreamingContext(sparkConf, Duration(appConfig.batchInterval.toMillis)) new SparkConsumerCassandra(streamingContext, appConfig).processStream() // this processStream creates the DStream from kafka using the direct approach streamingContext.checkpoint(appConfig.checkpointDir) streamingContext } ` – naticos Mar 17 '16 at 16:25

0 Answers0