Spark-submit not working when application jar is in hdfs

Question

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:

Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar. java.lang.ClassNotFoundException: com.example.SimpleApp

Here's the command:

$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar

I'm using hadoop version 2.6.0, spark version 1.2.1

what did you finally decide here? Did you switch to YARN or find another workaround? Sanjiv, below, was pointing at a bug that seems peripherally relevant. Did you try --deploy-mode cluster ? Thanks, interesting bug if it's really a bug, and doesn't seem to have been directly submitted to JIRA. Perhaps check [this](https://issues.apache.org/jira/browse/SPARK-10643) — JimLohse, Feb 23 '16 at 13:33

score 23 · Accepted Answer · answered Apr 02 '15 at 01:05

23

The only way it worked for me, when I was using

--master yarn-cluster

answered Apr 02 '15 at 01:05

Romain

7,022
3
30
30

4

What if they don't want to use YARN? I see this is the accepted answer yet the OP was trying to use local[*]? Eeen-teresting. – JimLohse Feb 23 '16 at 13:32
--master yarn-cluster is not working for me. Following is my snippet of the logs: Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: master yarn-cluster Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: deployMode cluster Apr 11, 2018 9:22:20 AM org.apache.spark.launcher.OutputRedirector redirect INFO: Warning: Skip remote jar hdfs://locahlost/user/MyUser/Sample-1.0.1Manish-SNAPSHOT.jar. – Bay Max Apr 11 '18 at 04:08

score 10 · Answer 2 · edited Mar 02 '16 at 20:02

10

To make HDFS library accessible to spark-job , you have to run job in cluster mode.

$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar

Also, There is Spark JIRA raised for client mode which is not supported yet.

SPARK-10643 :Support HDFS application download in client mode spark submit

edited Mar 02 '16 at 20:02

enrique-carbonell

5,836
3
30
44

answered Feb 23 '16 at 10:50

Sanjiv

1,795
1
29
45

Nice answer to me this should be accepted :) but you are not showing cluster mode, you are showing yarn, you need `--deploy-mode cluster` and `--master spark://yourmaster:7077` instead of `--master yarn-cluster`? If the OP said he's using YARN I missed it, though I guess HDFS is a good clue. I think, as stated, the OP is trying to use the Spark job manager and finding a bug with local mode? – JimLohse Feb 23 '16 at 13:42

score 1 · Answer 3 · edited Aug 31 '16 at 18:11

There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.

I did the same (with azure blob storage, but it should be similar for HDFS)

example command for azure wasb

sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777

Now, in your spark submit command, you provide the path from the command above

$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar

score 0 · Answer 4 · answered Apr 25 '22 at 16:57

0

spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py

For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.

answered Apr 25 '22 at 16:57

Kumar Sanu

206
2
12

score -2 · Answer 5 · answered Feb 26 '15 at 10:30

-2

Yes, it has to be a local file. I think that's simply the answer.

answered Feb 26 '15 at 10:30

Sean Owen

66,182
23
141
173

6

But in the [official documentation](https://spark.apache.org/docs/1.2.1/submitting-applications.html), it stated there that: "application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an **hdfs:// path** or a file:// path that is present on all nodes." – dilm Feb 27 '15 at 01:30
@dlim good point. That is worth a question to the user@ mailing list. From skimming the code it looks like it specifically only allows local files – Sean Owen Feb 27 '15 at 01:58
Thanks. I'll try the mailing list for now. – dilm Feb 27 '15 at 02:02
1

Was there an answer on the mailing lists? – Michael Lloyd Lee mlk Oct 21 '15 at 10:17
1

You have use --master yarn-cluster in your spark submit provided that you use Yarn as your cluster manager. – dilm Nov 04 '15 at 00:57
The mailing list is not that useful, when there's an answer it's great but so many questions go unanswered! They need gamification like SO, really seems to work. Meanwhile the answer from Sanjiv seems like it has identified [SPARK-10643](https://issues.apache.org/jira/browse/SPARK-10643) which deals with this, so you must use --deploy-mode cluster explicitly. Of course local[*] won't work with that. But that bug, now that I look at it, doesn't seem to deal with this directly. – JimLohse Feb 23 '16 at 13:31

Spark-submit not working when application jar is in hdfs

5 Answers5

Linked