15

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).

Run 1 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))

external package: external_package.py

class external(object):
    def __init__(self):
        pass
    def fun(self,input):
        return input*2

readme.txt

MY TEXT HERE

spark-submit command

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  /local-pgm-path/spark_distro.py  \
  1000

Output: Working as expected

['MY TEXT HERE']

But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing. Like below.

Run 2 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))

external_package.py Same as above

spark submit

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  --files /local-path/readme.txt#readme.txt  \
  /local-pgm-path/spark_distro.py  \
  1000

Output:

Traceback (most recent call last):
  File "/local-pgm-path/spark_distro.py", line 31, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'

Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.

goks
  • 1,196
  • 3
  • 18
  • 37
  • Just out of curiosity, why you keep including `1000` in `spark-submit`? True, it is used in the examples, but only because there it is indeed an expected argument of [`SparkPi.scala`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala) and [`pi.py`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala), which is not the case with your `spark_distro.py`... – desertnaut Nov 09 '17 at 13:00
  • 1
    @desertnaut Error message above says it is looking for 'readme.txt' file in Driver path "u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'" and in the logs my local file('readme.txt') is getting copied to Driver node location (using '--files' option). spark-submit command above is using 'yarn-client'. i think chaning this to 'yarn-cluster' make any difference? – goks Nov 09 '17 at 17:23
  • Does not (checked it already) – desertnaut Nov 09 '17 at 17:24
  • 2
    I have started suspecting that the two methods (`sc.addFile` & `--files`) are not equivalent - in any case, I have not managed to make `--files` work, nor can I find any reference online from someone who has in the past... – desertnaut Nov 09 '17 at 17:30
  • 1
    @desertnaut what problem you have faced with `--files` ??? – Thang Nguyen Nov 10 '17 at 01:01
  • @cue simply doesn't work. I can see the file uploaded in `.sparkStaging` dir, but I always get the error reported above – desertnaut Nov 10 '17 at 11:23
  • @desertnaut 'SparkFiles.get' is referring to master node but the actual file is copied to default staging path. If we somehow point to staging path I think we can resolve the issue. I will try to use 'os.environ' and update in this thread. – goks Nov 10 '17 at 13:21
  • @GokulkrishnaSurapureddy standby - I have figured it out and finishing my answer... – desertnaut Nov 10 '17 at 13:32

1 Answers1

23

I have finally figured out the issue, and it is a very subtle one indeed.

As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):

addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.

In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.

Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):

test_fail.py:

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:  
    lines = [line.strip() for line in test_file]
print(lines)

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_fail.py

Result:

[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
  File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'

In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)

lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())

Test:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_success.py

Result:

[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']

Notice also that here we don't need SparkFiles.get - the file is readily accessible.

As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).

Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).

Tested with both Spark 1.6.0 & 2.2.0.

UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:

$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020

But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:

Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.

(Notice that I am not talking for --py-files, which serves a different, legitimate role.)

Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 2
    Super awesome :) – goks Nov 10 '17 at 13:39
  • 1
    --files /home/ctsats/readme.txt is copying the file to staging directory hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt but lines = sc.textFile("readme.txt") is expecting the file to be in hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/ directory. so its not working for me – goks Nov 10 '17 at 18:05
  • @goks Probably your `fs.defaultFS` setting in your `core-site.xml` conf file is set to HDFS, while mine must be the [default value](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/core-default.xml) `file:///`. Will check it on Monday (haven't access now) - in the meanwhile, see http://mail-archives.us.apache.org/mod_mbox/spark-user/201402.mbox/%3C0305B4C9-4B0C-4C29-82E9-A38B03BF1A13@gmail.com%3E – desertnaut Nov 11 '17 at 10:37
  • Yes. "fs.defaultFS" is pointing to HDFS. I think if i can access the value of "fs.defaultFS" (need to find a way) inside pyspark script(spark_distro.py) then i can access the file. `stg_path = str(fs.defaultFS) + "/user/" + str(os.environ['USER']) + "/.sparkStaging/" + str(sc.applicationId) + "/" lines = sc.textFile(os.path.join(stg_path,'readme.txt')) print(lines.collect())` – goks Nov 11 '17 at 17:11
  • I had this problem come up when initializing my SparkSession. The problem for me was when I left out the files in the .config('spark.files', files) spark.files Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed. https://spark.apache.org/docs/latest/configuration.html Add this so it is detectable in working dir, and get path to list all files in dir using getRootDirectory() – suhprano Jan 29 '20 at 23:44
  • 1
    @Fizi if you have a new issue, please open a new question (you can link here, if necessary); many details may have change in Spark since 2017, and comments are not the right place for such follow-up issues. – desertnaut Jan 27 '21 at 15:59
  • thats fair! I opened a new question here -https://stackoverflow.com/questions/65922476/spark-execution-a-single-way-to-access-file-contents-in-both-the-driver-and-ex – Fizi Jan 27 '21 at 17:32