I tried sc.addFile
option (working without any issues) and --files
option from the command line (failed).
Run 1 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))
external package: external_package.py
class external(object):
def __init__(self):
pass
def fun(self,input):
return input*2
readme.txt
MY TEXT HERE
spark-submit command
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
/local-pgm-path/spark_distro.py \
1000
Output: Working as expected
['MY TEXT HERE']
But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing. Like below.
Run 2 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))
external_package.py Same as above
spark submit
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
--files /local-path/readme.txt#readme.txt \
/local-pgm-path/spark_distro.py \
1000
Output:
Traceback (most recent call last):
File "/local-pgm-path/spark_distro.py", line 31, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'
Is sc.addFile
and --file
used for same purpose? Can someone please share your thoughts.