Spark runs in local but can't find file when running in YARN

Question

I've been trying to submit a simple python script to run it in a cluster with YARN. When I execute the job in local, there's no problem, everything works fine but when I run it in the cluster it fails.

I executed the submit with the following command:

spark-submit --master yarn --deploy-mode cluster test.py

The log error I'm receiving is the following one:

17/11/07 13:02:48 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:49 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:50 INFO yarn.Client: Application report for application_1510046813642_0010 (state: FAILED)
17/11/07 13:02:50 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1510046813642_0010 failed 2 times due to AM Container for appattempt_1510046813642_0010_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://myserver:8088/proxy/application_1510046813642_0010/Then, click on links to logs of each attempt.
**Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**
java.io.FileNotFoundException: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: root.users.josholsan
     start time: 1510056155796
     final status: FAILED
     tracking URL: http://myserver:8088/cluster/app/application_1510046813642_0010
     user: josholsan
Exception in thread "main" org.apache.spark.SparkException: Application application_1510046813642_0010 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/07 13:02:50 INFO util.ShutdownHookManager: Shutdown hook called
17/11/07 13:02:50 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851

I put special attention to this line

Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py

I don't know why it can't finde the test.py, I also tried to put it in HDFS under the directory of the user executing the job: /user/josholsan/

To finish my post I would like to share also my test.py script:

from pyspark import SparkContext

file="/user/josholsan/concepts_copy.csv"
sc = SparkContext("local","Test app")
textFile = sc.textFile(file).cache()

linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()
linesWithICD=textFile.filter(lambda line: "ICD" in line).count()

print("Lines with OMOP: %i, lines with ICD9: %i" % (linesWithOMOP,linesWithICD))

Could the error also be in here?:

sc = SparkContext("local","Test app")

Thanks you so much for your help in advance.

Is the path visible to the whole cluster? In mesos, you use the `--py-files` argument to submit python additional resources and to run the `test.py` file you need to give a path that every worker can access and see, e.g. `spark-submit --master yarn --deploy-mode cluster http://myfiles/test.py` — mkaran, Nov 07 '17 at 12:29
Also, what version of pyspark do you use? Since, there is another way to instantiate the sparkContext (also, afaik, `"local"` should not be specified because it overrides command line settings) — mkaran, Nov 07 '17 at 12:33
I'm using Spark 1.6 with Python 2.7. You are right, first of all I've connected each machine in the cluster with all the others for first time, and then I've launched the job again. This time also failed but with a different error (with not too much information). Then I edited my script by doing: sc = SparkContext() and this time everything was working fine. One last question: I'm pretty new with this things so... How can I retrieve the result of the job? Since it's just a print? or I should save it in a file or something like that to have a result? Thanks you so much again. — Jose LHS, Nov 07 '17 at 12:50
btw, can you put your comment as reply so I can set it as solution? — Jose LHS, Nov 07 '17 at 12:52
I'm glad it helped, I will put it as an answer shortly :) As for the results, I am not very familiar with running on `yarn` but after a search I found that there is a command `yarn logs` that will display the output of the job. More [here](https://stackoverflow.com/a/26322577/3433323) — mkaran, Nov 07 '17 at 13:16

mkaran · Accepted Answer · 2017-11-07T14:11:16.943

Transferred from the comments section:

sc = SparkContext("local","Test app"): having "local" here will override any command line settings; from the docs:

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
The test.py file must be placed somewhere where it is visible throughout the whole cluster. E.g. spark-submit --master yarn --deploy-mode cluster http://somewhere/accessible/to/master/and/workers/test.py
Any additional files and resources can be specified using the --py-files argument (tested in mesos, not in yarn unfortunately), e.g. --py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip

Edit: as @desertnaut commented, this argument should be used before the script to be executed.
yarn logs -applicationId <app ID> will give you the output of your submitted job. More here and here

Hope this helps, good luck!

Good advice (+1, added the docs link & excerpt). Just to add that any `--py-files` argument in the command line should go before the script to be executed: https://stackoverflow.com/questions/47101375/how-to-use-external-custom-package-in-pyspark/47102599#47102599 — desertnaut, Nov 07 '17 at 13:48
@desertnaut Thanks! Edited and added your comment in the body of the answer. — mkaran, Nov 07 '17 at 14:12

Spark runs in local but can't find file when running in YARN

1 Answers1

Linked