Apache Spark stopping JVM when master not available

Question

In my application Java spark context is created with an unavailable master URL (you may assume master is down for a maintenance). When creating Java spark context it leads to stopping JVM that runs spark driver with JVM exit code 50.

When I checked the logs I found SparkUncaughtExceptionHandler calling the System.exit. My program should run forever. How should I overcome this issue ?

I tried this scenario in spark version 1.4.1 and 1.6.0

My code is given below

package test.mains;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

public class CheckJavaSparkContext {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        SparkConf conf = new SparkConf();
        conf.setAppName("test");
        conf.setMaster("spark://sunshine:7077");

        try {
            new JavaSparkContext(conf);
        } catch (Throwable e) {
            System.out.println("Caught an exception : " + e.getMessage());
            //e.printStackTrace();
        }

        System.out.println("Waiting to complete...");
        while (true) {
        }
    }

}

Part of the output log

16/03/04 18:02:24 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/03/04 18:02:24 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/03/04 18:02:24 WARN AppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
16/03/04 18:02:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.deploy.client.AppClient.stop(AppClient.scala:290)
    at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.org$apache$spark$scheduler$cluster$SparkDeploySchedulerBackend$$stop(SparkDeploySchedulerBackend.scala:198)
    at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.stop(SparkDeploySchedulerBackend.scala:101)
    at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:446)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1582)
    at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1731)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1730)
    at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:127)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:134)
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1163)
    at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:129)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/03/04 18:02:24 INFO DiskBlockManager: Shutdown hook called
16/03/04 18:02:24 INFO ShutdownHookManager: Shutdown hook called
16/03/04 18:02:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-ea68a0fa-4f0d-4dbb-8407-cce90ef78a52
16/03/04 18:02:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-ea68a0fa-4f0d-4dbb-8407-cce90ef78a52/userFiles-db548748-a55c-4406-adcb-c09e63b118bd
Java Result: 50

score 0 · Answer 1 · answered Mar 05 '16 at 10:26

If application master is down application by itself will try to connect to the master three times with 20 second timeout. It looks like these parameters are hardcoded and not configurable. If application fails to connect there is nothing more you can do than to try to resubmit your application once it is up again.

That is why you should configure your cluster in a high availability mode. Spark Standalone supports two different modes:

where the second option should be applicable in production and useful in the described scenario.

What is the reason to call System.exit with exit code 50 ? It's unacceptable to do a such thing in a third party API code. When I checked the SparkUncaughtExceptionHandler.scala code it has this bad JVM kill code. — era, Mar 06 '16 at 18:34

Apache Spark stopping JVM when master not available

1 Answers1