2

I have a process (in scala) running in a spark cluster which processes some data, uploads result and updates the state of processing. I want the upload and processing state update to be atomic operation, since the state is crucial for resuming the job and avoid double processing. There is a need to kill the running job regularly and start a new one whenever we want to update the jar. While killing the job I want to handle the atomic operation and gracefully exit either before upload or wait till upload and processing state update is complete. How can the same be achieved? If we use the yarn APIs to kill the application it might exit abruptly from an inconsistent state. What is the best way to tackle it?

Sayantan Ghosh
  • 998
  • 2
  • 9
  • 29

1 Answers1

5

You can enable the graceful shutdown in your Spark configuation with

sparkConf.set(“spark.streaming.stopGracefullyOnShutdown","true") 

When your job runs on YARN you would now need to send a SIGTERM to the application. This is usually done through yarn application -kill <appID>. This command does send a SIGTERM to your driver but it also almost immediately - "yarn.nodemanager.sleep-delay-before-sigkill.ms" (default 250) - sends a SIGKILL afterwards.

Therefore, you rather want to make sure that only a SIGTERM is sent, e.g. by calling:

ps -ef | grep spark | grep <DriverProgramName> | awk '{print $2}' | xargs kill -SIGTERM

This answer is based on the blogs 1 and 2 which give you more details.

In one of the articles it is also described how to gracefully shutdown your application through a marker file.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77