spark-submit --files hdfs://file get cached in /tmp on driver

Question

I'm running a spark-submit like this:

spark-submit --deploy-mode client 
             --master yarn 
             --conf spark.files.overwrite=true 
             --conf spark.local.dir='/my/other/tmp/with/more/space' 
             --conf spark.executor.extraJavaOptions='-Djava.io.tmpdir=/my/other/tmp/with/more/space' 
             --conf spark.driver.extraJavaOptions='-Djava.io.tmpdir=/my/other/tmp/with/more/space'
             --files hdfs:///a_big_file.binary,hdfs:///another_big_file.binary 
              ... etc.

I need to add these two binary-files to the nodes in this way, since they are parsed by an external *.dll/*.so in the workers which can just process local files.

Now running in yarn=master deploy-mode=client my node gets driver and therefore pulls the files from hdfs to /tmp directory. Since these files are pretty big it fills up my limited /tmp directory pretty fast.

I wonder if anybody can point out the setting to change this path from /tmp to /my/other/tmp/with/more/space since I already set the arguments spark.local.dir, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.

Thank you, Maffe

score 0 · Answer 1 · answered Nov 27 '18 at 11:13

0

If you already have those files on hdfs, you should not pass them as --files argument. --files should be used to create a local copy of some static data on each executor node. In your case you should pass file locations as spark job arguments to access later.

answered Nov 27 '18 at 11:13

Danylo Rosiichuk

219
1
5

Hi, that is exactly what i pointed out. I need these files locally since the third-party tool can't load them from hdfs, but just from a local file. – maffe Nov 27 '18 at 13:20

spark-submit --files hdfs://file get cached in /tmp on driver

1 Answers1