Apache Spark spark-submit read files from --files parameter

Question

I have spark submit scirpt as follows:

spark-submit \
  --name daily_job\
  --class com.test.Bootstrapper \
  --files /home/user/*.csv\
  --conf spark.executor.memory=2g\
  --conf spark.executor.cores=2\
  --master spark://172.17.0.4:7077\
  --deploy-mode client \
  --packages com.typesafe:config:1.3.1\
  file:///home/user/workspace/spark-test/target/spark-test-0.1-SNAPSHOT.jar

Cluster configuration - master & 2 workers in different containers.

After job started I can see that csv files are being put into:

Worker:

/usr/local/spark-2.0.2-bin-hadoop2.7/work/app-20170116160937-0036/0/test.csv

Driver:

/tmp/spark-f65b2466-e419-49bd-8da7-9f2b94cbf870/userFiles-abb14b33-58b1-47d6-935e-6c2943e3d55c/test.csv

The question is - how to properly read this file? Currently I am doing as follows:

private var initial: DataFrame = spark.sqlContext.read
    .option("mode", "DROPMALFORMED")
    .option("delimiter", conf.delimiter)
    .option("dateFormat", conf.dateFormat)
    .schema(conf.schema)
    .csv("file:///*.csv")

Which results in FileNotFoundException.

score 0 · Answer 1 · answered Jan 16 '17 at 16:57

If you are using --files, the files will be placed in the working directory of each executor. So you can access them using the same path you specify in the submission command:

var initial = spark.read
    .option("mode", "DROPMALFORMED")
    .option("delimiter", conf.delimiter)
    .option("dateFormat", conf.dateFormat)
    .schema(conf.schema)
    .csv("file:///home/user/*.csv")

Alternatively, you could use SparkContext.addFile() and SparkFiles.get()

Apache Spark spark-submit read files from --files parameter

1 Answers1