Spark: how to set worker-specific SPARK_HOME in standalone mode

Question

I'm setting up a [somewhat ad-hoc] cluster of Spark workers: namely, a couple of lab machines that I have sitting around. However, I've run into a problem when I attempt to start the cluster with start-all.sh: namely, Spark is installed in different directories on the various workers. But the master invokes $SPARK_HOME/sbin/start-all.sh on each one using the master's definition of $SPARK_HOME, even though the path is different for each worker.

Assuming I can't install Spark on identical paths on each worker to the master, how can I get the master to recognize the different worker paths?

EDIT #1 Hmm, found this thread in the Spark mailing list, strongly suggesting that this is the current implementation--assuming $SPARK_HOME is the same for all workers.

Would you mind taking a look at my reply to this mailing list thread? I have a question about a configuring different `log4j.properties` per worker that I can't seem to overcome. This isn't what I'd use in reality, but for mucking around and understanding what's going on it would be of help — Brad, Feb 13 '15 at 17:08

score 0 · Answer 1 · answered Feb 13 '15 at 17:07

I'm playing around with Spark on Windows (my laptop) and have two worker nodes running by starting them manually using a script that contains the following

set SPARK_HOME=C:\dev\programs\spark-1.2.0-worker1
set SPARK_MASTER_IP=master.brad.com 
spark-class org.apache.spark.deploy.worker.Worker spark://master.brad.com:7077

I then create a copy of this script with a different SPARK_HOME defined to run my second worker from. When I kick off a spark-submit I see this on Worker_1

15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker1\bin...

and this on Worker_2

15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker2\bin...

So it works, and in my case I duplicated the spark installation directory, but you may be able to get around this

score 0 · Answer 2 · answered Oct 25 '16 at 21:36

0

You might want to consider assign the name by changing SPARK_WORKER_DIR line in the spark-env.sh file.

answered Oct 25 '16 at 21:36

payamf1

265
2
11

This did not seem to make a difference in my experiments – Chris Smith Nov 02 '22 at 01:55

Chris Smith · Answer 3 · 2022-11-09T17:06:16.753

A similar question was asked here

The solution I used was to create a symbolic link mimicking the master node's installation path on each worker node so when the start-all.sh executing on the master node does its SSH into the worker node, it will see identical pathing to run the worker scripts.

Example in my case, I had 2 Macs and 1 Linux machine. Both Macs had spark installed under /Users/<user>/spark however the Linux machine had it under /home/<user>/spark. One of the Macs was the master node so running the start-all.sh it would error each time on the Linux machine due to pathing (error: /Users/<user>/spark does not exist)).

The simple solution was to mimic the Mac's pathing on the Linux machine using a symbolic link:

open terminal

cd / <-- go to the root of the drive

sudo ln -s home Users <-- create a sym link "Users" pointing to the actual "home" directory.

Spark: how to set worker-specific SPARK_HOME in standalone mode

3 Answers3