0

According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the executors

test.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("File access test")
sc = SparkContext(conf=conf)
sc.addFile("file:///home/hadoop/uploads/readme.txt")

with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines) # this works
print('********************')
lines = sc.textFile(SparkFiles.get('readme.txt')) # run in the executors. this errors
print(lines.collect())

command

spark-submit --master yarn --deploy-mode client test.py

readme.txt is under /home/hadoop/uploads in the master node

I see the following in logs

21/01/27 15:03:30 INFO SparkContext: Added file file:///home/hadoop/uploads/readme.txt at spark://ip-10-133-70-121.sysco.net:44401/files/readme.txt with timestamp 1611759810247
21/01/27 15:03:30 INFO Utils: Copying /home/hadoop/uploads/readme.txt to /mnt/tmp/spark-f929a1e2-e7e8-401e-8e2e-dcd1def3ee7b/userFiles-fed4d5bf-3e31-4e1e-b2ae-3d4782ca265c/readme.txt

So its copying it to some spark directory and mount ( I am still relatively new to the spark world). If I use the --files flag and pass the file it also copies it to an hdfs:// path that can be read by the executors.

Is this because the addFile requires the file to also be present on the executors locally. Currently the readme.txt is on the master node. If so is there a way to propagate it to executors from the master.

I am trying to find one uniform way of accessing the file. I am able to push the file from the local machine to master node. In the spark code however I would like a single way of accessing the contents of a file whether it be the driver or the executor

Currently for the executor part of the code to work I have to also pass the file in the --files flag (spark-submit --master yarn --deploy-mode client --files uploads/readme.txt test.py) and use something like the following

path = f'hdfs://{sc.getConf().get("spark.driver.host")}:8020/user/hadoop/.sparkStaging/{sc.getConf().get("spark.app.id")}/readme.txt'
lines = sc.textFile(path)
Fizi
  • 1,749
  • 4
  • 29
  • 55

3 Answers3

0

One way you can do this is by putting the code files on an s3 bucket and then pointing to the file locations in your spark submit. In that case, all the worker nodes will get the same file from s3.

Make sure that your EMR nodes have access to that s3 bucket.

nimish
  • 323
  • 1
  • 5
0

If you are using Jupter Notebooks, You can use below code snippet to write artifact into HDFS from your local spark executor visible path.

from pyspark import SparkContext 

sc = SparkContext.getOrCreate()
filesystem = sc._jvm.org.apache.hadoop.fs.FileSystem
fs = filesystem.get(sc._jsc.hadoopConfiguration())
Path = sc._jvm.org.apache.hadoop.fs.Path

doc_name = "readme.txt"

# Copying Executor -> HDFS
fs.copyFromLocalFile(
    False, # Don't delete local file
    True,  # Overwrite dest file
    Path(doc_name), # src
    Path(doc_name)  # dst
)
print("My HDFS file path is...\n", fs.getWorkingDirectory() + "/" + doc_name, "\n");

Then copy from HDFS into Jupyter server visible path using following CLI command;

%%bash
# Copy HDFS to Local FS
hdfs dfs -copyToLocal -f "hdfs://<name_node>:8020/user/<user>/readme.txt" .

Haven't tested this on EMR, but works fine with YARN cluster in a local setup.

Govinnage Rasika Perera
  • 2,134
  • 1
  • 21
  • 33
0

You can use --archives to share your files across driver and executors.

Keep your archive in the below format in s3.

references.zip 
 |_file1.txt
 |_file2.txt
 |_reference.ini


spark-submit --deploy-mode cluster --master yarn --archives s3://bucket/references.zip#references s3://bucket/spark_script.py

Using#references here will unzip all the files under references/ directory.

You can access the files like this inside executors/driver:

with open('references/file1.txt') as f:
    data1 = f.read()

and

config = configparser.ConfigParser()
config.read('references/reference.ini')
sakthi srinivas
  • 182
  • 1
  • 4
  • 12