3

1st Question : I have a 2 node virtual cluster with hadoop. I have a jar that runs a spark job. This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.

I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.

This is the command i used to run it :

./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class 

univ.bigdata.course.MainRunner --master yarn\
 --deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt

Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :

hdfs://master:9000/user/vagrant/commands.txt

2nd Question : How do i write to a file on the driver machine in the cwd ? I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with

 -master local[]

But, when running in

 -master yarn

I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?

Thanks.

Ethan
  • 261
  • 5
  • 16

3 Answers3

3

Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.

Question 2:

This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:

--driver-memory 16G --conf "spark.driver.maxResultSize=15g"

This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.

The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions

Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.

uberwach
  • 1,119
  • 7
  • 14
  • I have used collect \ take when i needed to write things to a file. And it worked when running in local[*] but as soon as i ran it on the cluster it just creates no file at all. – Ethan Jun 30 '16 at 12:16
  • Is your driver on the cluster as well? Note that the driver is not necessarily the master. The driver is the computer where you run spark-submit or spark-shell. – uberwach Jun 30 '16 at 12:33
  • Yes, i log into master and run spark-submit from there. – Ethan Jun 30 '16 at 12:41
0

Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.

Ethan
  • 261
  • 5
  • 16
0

Answer to Question 1: Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.