26

I am a spark/yarn newbie, run into exitCode=13 when I submit a spark job on yarn cluster. When the spark job is running in local mode, everything is fine.

The command I used is:

/usr/hdp/current/spark-client/bin/spark-submit --class com.test.sparkTest --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/hdp/current/spark-client/conf/hive-site.xml /home/user/sparkTest.jar*

Spark Error Log:

16/04/12 17:59:30 INFO Client:
         client token: N/A
         diagnostics: Application application_1459460037715_23007 failed 2 times due to AM Container for appattempt_1459460037715_23007_000002 exited with  exitCode: 13
For more detailed output, check application tracking page:http://b-r06f2-prod.phx2.cpe.net:8088/cluster/app/application_1459460037715_23007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e40_1459460037715_23007_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)


**Yarn logs**

    16/04/12 23:55:35 INFO mapreduce.TableInputFormatBase: Input split length: 977 M bytes.
16/04/12 23:55:41 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:55:51 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:01 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152f0b4fc0e7488
16/04/12 23:56:11 INFO zookeeper.ZooKeeper: Session: 0x152f0b4fc0e7488 closed
16/04/12 23:56:11 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/12 23:56:11 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 2). 2003 bytes result sent to driver
16/04/12 23:56:11 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 82134 ms on localhost (2/3)
16/04/12 23:56:17 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x4508c270df0980316/04/12 23:56:17 INFO zookeeper.ZooKeeper: Session: 0x4508c270df09803 closed *
...
    16/04/12 23:56:21 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/04/12 23:56:21 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
16/04/12 23:56:21 INFO spark.SparkContext: Invoking stop() from shutdown hook *
user_not_found
  • 471
  • 2
  • 6
  • 13
  • Could you share the yarn logs as well (not the whole logs, just the error messages in yarn logs)? – user1314742 Apr 11 '16 at 12:09
  • 4
    You could get yarn logs: `$ yarn logs -applicationId application_1459460037715_18191` – user1314742 Apr 11 '16 at 12:10
  • Thanks for the response. So it turns out exitCode 10 is because of the classNotFound issue. After a quick fix of that, I encountered the new issue with exit Code 13 when spark job is running on yarn cluster. It is work well in local mode. I have updated the question as well as logs so it won't confuse people :) – user_not_found Apr 13 '16 at 00:30
  • 1
    Have you set the master in your code? like doing `SparkConf.setMaster("local[*]")` ? – user1314742 Apr 13 '16 at 12:31
  • 1
    You are totally right! :) Thanks a lot. I have made the same issue before in another place and the exit code was 15. So when it's 13 this time, I didn't even look back my code as the log, so dumm. – user_not_found Apr 13 '16 at 17:40
  • Good.. I ll put as an answer so you could mark your question as answered :) – user1314742 Apr 13 '16 at 17:44

5 Answers5

37

It seems that you have set the master in your code to be local

SparkConf.setMaster("local[*]")

You have to let the master unset in the code, and set it later when you issue spark-submit

spark-submit --master yarn-client ...

user1314742
  • 2,865
  • 3
  • 28
  • 34
7

If it helps someone

Another possibility of this error is when you put incorrectly the --class param

4

I had exactly the same problem but the above answer didn't work. Alternatively, when I ran this with spark-submit --deploy-mode client everything worked fine.

Sahas
  • 3,046
  • 6
  • 32
  • 53
  • Does anyone understand the reason for this? – Omkar Neogi Jan 30 '20 at 21:58
  • Yes, this solved my problem. I was using `spark-submit --deploy-mode cluster`, but when I changed it to `client`, it worked fine. In my case, I was executing SQL scripts using a python code, so my code was not "spark dependent", but I am not sure what will be the implications of doing this when you want multiprocessing. – Sajal Apr 12 '21 at 14:18
  • --deploy-mode client only runs spark on the master driver node. This does not use Spark's worker (core & task) nodes. Instead, use --deploy-mode cluster to distribute work across workers. – BeerIsGood Aug 25 '21 at 07:58
  • @BeerIsGood, that's only true of the single-threaded code you run. Any actual spark operations (reads, writes, maps, filters, etc) are distributed by the master node across the entire cluster, even in client mode. The difference between client and cluster modes is how the work gets submitted to the cluster and which nodes get used for what. – Nolan Barth May 16 '22 at 16:56
  • In case, it's not obvious, piecing several answers together, you can get this error when your jars aren't available on the worker nodes but try to access them there (via cluster mode). If they are available on the "master" node, then switching to client mode will work. It's a bad error message, made more confusing if you happen to be submitting "Steps" remotely to an AWS EMR cluster from another machine (or something other cloud provider/managed spark/hadoop service). Because from that perspective you're accessing the "master" node as a server with your AWS CLI client. – combinatorist Nov 29 '22 at 18:22
4

This exit code 13 is a tricky one...

For me it was SyntaxError: invalid syntax that was in one of the scripts imports downstream to the spark-submit call.

When debugging this on aws, if the spark-submit was not initialized properly, you will not find the error on Spark History Server, you will have to find it on the Spark logs: EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.

Stempler
  • 1,309
  • 13
  • 25
1

I got this same error running a SparkSQL job in cluster mode. None of the other solutions worked for me but looking in the job history server logs in Hadoop I found this stack trace.

20/02/05 23:01:24 INFO hive.metastore: Connected to metastore.
20/02/05 23:03:03 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
...


and looking at the Spark source code you'll find that basically the AM timed out waiting for the spark.driver.port property to be set by the Thread executing the user class.
So it could either be a transient issue or you should investigate your code for the reason for a timeout.

sbrk
  • 1,338
  • 1
  • 17
  • 25