1

I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration

  • master on machine 1 (port 7077 exposed)
  • worker on machine 1
  • driver on machine 2

I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText()

But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```

that goes on and on again without actually computing the app.

This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)

zero323
  • 322,348
  • 103
  • 959
  • 935
Matthias Beaupère
  • 1,731
  • 2
  • 17
  • 44

1 Answers1

4

TL;DR Make sure that spark.driver.host:spark.driver.port can be accessed from each node in the cluster.

In general you have ensure that all nodes (both executors and master) can reach the driver.

  • In the cluster mode, where driver runs on one of the executors this is satisfied by default, as long as no ports are closed for the connections (see below).
  • In client mode machine, on which driver has been started, has to be accessible from the cluster. It means that spark.driver.host has to resolve to a publicly reachable address.

In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.

Furthermore:

when when the file is only present on worker

won't work. All inputs have to be accessible from driver, as well as, from each executor node.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank to your answer, I think the issue is with communications between docker containers across network. – Matthias Beaupère Jan 17 '18 at 14:01
  • I finally started my drivers and worker dockers with `--net=host` option and it worked. This is only a temp solution as it doesn't give me network isolation. I'm now studying docker to find a proper solution. – Matthias Beaupère Jan 22 '18 at 09:31
  • Hi @user6910411 - for some reason when I start the sparkContext from my local machine to a remote spark cluster, the value for spark.driver.host is the local IP of my machine which is obviously not reachable from the worker node. How can this be resolved? – nEO Jan 03 '20 at 05:24
  • @nEO `spark.driver.host` (and optionally `spark.driver.bindAddress`). – user10938362 Jan 04 '20 at 11:58