Spark runs on Yarn cluster exitCode=13:

Question

I am a spark/yarn newbie, run into exitCode=13 when I submit a spark job on yarn cluster. When the spark job is running in local mode, everything is fine.

The command I used is:

/usr/hdp/current/spark-client/bin/spark-submit --class com.test.sparkTest --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/hdp/current/spark-client/conf/hive-site.xml /home/user/sparkTest.jar*

Spark Error Log:

16/04/12 17:59:30 INFO Client:
         client token: N/A
         diagnostics: Application application_1459460037715_23007 failed 2 times due to AM Container for appattempt_1459460037715_23007_000002 exited with  exitCode: 13
For more detailed output, check application tracking page:http://b-r06f2-prod.phx2.cpe.net:8088/cluster/app/application_1459460037715_23007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e40_1459460037715_23007_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)


**Yarn logs**

    16/04/12 23:55:35 INFO mapreduce.TableInputFormatBase: Input split length: 977 M bytes.
16/04/12 23:55:41 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:55:51 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:01 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO yarn.ApplicationMaster: Waiting for spark context initialization ...
16/04/12 23:56:11 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152f0b4fc0e7488
16/04/12 23:56:11 INFO zookeeper.ZooKeeper: Session: 0x152f0b4fc0e7488 closed
16/04/12 23:56:11 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/12 23:56:11 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 2). 2003 bytes result sent to driver
16/04/12 23:56:11 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 82134 ms on localhost (2/3)
16/04/12 23:56:17 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x4508c270df0980316/04/12 23:56:17 INFO zookeeper.ZooKeeper: Session: 0x4508c270df09803 closed *
...
    16/04/12 23:56:21 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/04/12 23:56:21 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
16/04/12 23:56:21 INFO spark.SparkContext: Invoking stop() from shutdown hook *

Could you share the yarn logs as well (not the whole logs, just the error messages in yarn logs)? — user1314742, Apr 11 '16 at 12:09
You could get yarn logs: `$ yarn logs -applicationId application_1459460037715_18191` — user1314742, Apr 11 '16 at 12:10
Thanks for the response. So it turns out exitCode 10 is because of the classNotFound issue. After a quick fix of that, I encountered the new issue with exit Code 13 when spark job is running on yarn cluster. It is work well in local mode. I have updated the question as well as logs so it won't confuse people :) — user_not_found, Apr 13 '16 at 00:30
Have you set the master in your code? like doing `SparkConf.setMaster("local[*]")` ? — user1314742, Apr 13 '16 at 12:31
You are totally right! :) Thanks a lot. I have made the same issue before in another place and the exit code was 15. So when it's 13 this time, I didn't even look back my code as the log, so dumm. — user_not_found, Apr 13 '16 at 17:40
Good.. I ll put as an answer so you could mark your question as answered :) — user1314742, Apr 13 '16 at 17:44

user1314742 · Accepted Answer · 2019-06-29T13:02:21.503

37

It seems that you have set the master in your code to be local

SparkConf.setMaster("local[*]")

You have to let the master unset in the code, and set it later when you issue spark-submit

spark-submit --master yarn-client ...

edited Jun 29 '19 at 13:02

answered Apr 13 '16 at 17:46

user1314742

2,865
3
28
34

2

what if i want to submit to --master yarn --deploy-mode cluster ...its giving error . – mahendra maid Aug 20 '18 at 19:08
what is the error?? because this is the new way in spark submit for version 2+ so it shouldn't give an error – user1314742 Aug 20 '18 at 21:23

score 7 · Answer 2 · answered May 10 '19 at 20:56

7

If it helps someone

Another possibility of this error is when you put incorrectly the --class param

answered May 10 '19 at 20:56

Jhon Mario Lotero

307
2
6

What is this class parameter? And how can you know your class param? – Eli Borodach Jul 10 '20 at 12:43
This is the main class that you want to execute with spark , may be this question can help you https://stackoverflow.com/questions/50205621/how-to-spark-submit-with-main-class-in-jar – Jhon Mario Lotero Jul 10 '20 at 16:03

score 4 · Answer 3 · answered Aug 16 '19 at 01:18

4

I had exactly the same problem but the above answer didn't work. Alternatively, when I ran this with spark-submit --deploy-mode client everything worked fine.

answered Aug 16 '19 at 01:18

Sahas

3,046
6
32
53

Does anyone understand the reason for this? – Omkar Neogi Jan 30 '20 at 21:58
Yes, this solved my problem. I was using `spark-submit --deploy-mode cluster`, but when I changed it to `client`, it worked fine. In my case, I was executing SQL scripts using a python code, so my code was not "spark dependent", but I am not sure what will be the implications of doing this when you want multiprocessing. – Sajal Apr 12 '21 at 14:18
--deploy-mode client only runs spark on the master driver node. This does not use Spark's worker (core & task) nodes. Instead, use --deploy-mode cluster to distribute work across workers. – BeerIsGood Aug 25 '21 at 07:58
@BeerIsGood, that's only true of the single-threaded code you run. Any actual spark operations (reads, writes, maps, filters, etc) are distributed by the master node across the entire cluster, even in client mode. The difference between client and cluster modes is how the work gets submitted to the cluster and which nodes get used for what. – Nolan Barth May 16 '22 at 16:56
In case, it's not obvious, piecing several answers together, you can get this error when your jars aren't available on the worker nodes but try to access them there (via cluster mode). If they are available on the "master" node, then switching to client mode will work. It's a bad error message, made more confusing if you happen to be submitting "Steps" remotely to an AWS EMR cluster from another machine (or something other cloud provider/managed spark/hadoop service). Because from that perspective you're accessing the "master" node as a server with your AWS CLI client. – combinatorist Nov 29 '22 at 18:22

score 4 · Answer 4 · answered Aug 30 '22 at 21:14

4

This exit code 13 is a tricky one...

For me it was SyntaxError: invalid syntax that was in one of the scripts imports downstream to the spark-submit call.

When debugging this on aws, if the spark-submit was not initialized properly, you will not find the error on Spark History Server, you will have to find it on the Spark logs: EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.

answered Aug 30 '22 at 21:14

Stempler

1,309
13
25

Lesson learned: exit code 13 could mean a very wide variety of different errors in the spark job parameters. Bummer. – combinatorist Nov 29 '22 at 18:24

score 1 · Answer 5 · answered Feb 06 '20 at 00:14

I got this same error running a SparkSQL job in cluster mode. None of the other solutions worked for me but looking in the job history server logs in Hadoop I found this stack trace.

20/02/05 23:01:24 INFO hive.metastore: Connected to metastore.
20/02/05 23:03:03 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
...

and looking at the Spark source code you'll find that basically the AM timed out waiting for the spark.driver.port property to be set by the Thread executing the user class.
So it could either be a transient issue or you should investigate your code for the reason for a timeout.

Spark runs on Yarn cluster exitCode=13:

5 Answers5

Linked