2

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning.

The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up sharing the same temp folder and cause conflict, which can cause one job to delete temp files, and the other job fail with an error saying an expected temp file doesn't exist.

Can we change temporary path for individual job or is there any alternate way to avoid issue

techie
  • 313
  • 1
  • 8
  • 23

1 Answers1

0

To change the temp location you can do this:

/opt/spark/bin/spark-shell --conf "spark.local.dir=/local/spark-temp"

spark.local.dir changes where all temp files are read and written to, I would advise building and opening the positions of this location via command line before the first session with this argument is run.

afeldman
  • 492
  • 2
  • 10
  • but changing this will change location for all runs. I am running jobs via oozie workflow. Need to create a seperate path for every job run – techie Mar 26 '19 at 17:53
  • This will not assist with the Oozie workflow manager, however is this the idea you are looking for. Each independnet Spark-Job reading and writing to their own temporary folders: `/opt/spark/bin/spark-submit --driver-memory 5g --execurtor-memory 2g --conf "spark.local.dir=/tmp/finance/spark-logs" --class org.apache.spark.company.driver /folder/to/jar/driver.jar financial_run /opt/spark/bin/spark-submit --driver-memory 5g --execurtor-memory 2g --conf "spark.local.dir=/tmp/procurement/spark-logs" --class org.apache.spark.company.driver /folder/to/jar/driver.jar procurement_run` – afeldman Mar 26 '19 at 19:02