how to run Spark job on specific nodes

Question

For example my Spark cluster has 100 nodes(workers), when I run one job I just want it be ran on some 10 specific nodes, how should I achieve this. btw, I'm using Spark standalone module.

Why Do I need the above requirement:

One of my Spark job needs to access NFS, but there are only 10 nodes were
permitted to access NFS, so if the job was distributed on each Worker nodes(100 nodes),
then access deny exception would happen and the job would failed.

Add more details (e.g. the jobs set up, configuration, the job's code, etc..). — Yevhen Dubinin, May 29 '16 at 15:08
Spark on Mesos allows you to set constraints based on attributes and resources but AFAIK standalone has nothing of this sort. Why do you need this? — zero323, May 29 '16 at 16:52

score 3 · Answer 1 · answered May 29 '16 at 18:25

3

Spark workers are "chosen" by data locality. Meaning that the only workers on nodes that the data is on will be working. So, one way to do this is simply to save your files on these nodes only. There is no direct way of choosing specific nodes in standalone mode. Moreover, this would imply that the job will always need to start by moving large amounts of data between nodes which is not very efficient.

answered May 29 '16 at 18:25

z-star

680
5
6

Thanks, If I use Yarn or Mesos, is that possible to indicate specific nodes for one job please? – Jack May 29 '16 at 19:22

score 2 · Answer 2 · answered Oct 22 '18 at 13:47

You can use the documentation here.
These instructions and the files below exist when you perform the installation of the cluster using the bootstrap node You first need to add MESOS_ATTRIBUTES as here.
Just add the following line on the nodes you want under /var/lib/dcos/mesos-slave-common (or whatever kind your node is (slave|master|public) ) and restart the agent service systemctl restart dcos-mesos-slave.service

TIP: you can check the environment files that are loaded on the unit file /etc/systemd/system/dcos-mesos-<mesos-node-type>.service

MESOS_ATTRIBUTES=<attribute>:<value>,<attribute>:<value> ...

Then following the documentation you can submit your spark job :

docker run mesosphere/spark:2.3.1-2.2.1-2-hadoop-2.6 /opt/spark/dist/bin/spark-submit --deploy-mode cluster   ... --conf spark.mesos.constraints="<attribute>:<value>" --conf spark.mesos.driver.constraints="<attribute>:<value>"  ...

Keep in mind that :
spark.mesos.constraints is for the executors
spark.mesos.driver.constraints is for the driver
depending if you want the drivers or the executor to access the data you want and the docker images will be created on the nodes with the attributes you specified.

how to run Spark job on specific nodes

2 Answers2