0

I set up a small Spark environment on two machines. One runs a master and a worker, and the other one runs a worker only. I can use this cluster using the Spark Shell like:

spark-shell --master spark://mymaster.example.internal:7077

I can run computations in there that get distributed to the nodes correctly, so everything runs fine.

However, I am having trouble when using the spark-jobserver.

First try was to start the Docker container (with the environment variable SPARK_MASTER pointing to the correct master URL). When the job was started, the worker it was pushed to complained that it couldn't connect back to 172.18.x.y:nnnn. This was clear because this was the internal IP address of the Docker container the jobserver ran in.

So, I ran the jobserver container again with --network host so it attached itself to the host network. However, starting the job led to a Connection refused again, this time saying it couldn't connect to 172.30.10.10:nnnn. 172.30.10.10 is the IP address of the host I want to run the jobserver on and it IS reachable from both worker and master nodes (The Spark instances run in Docker containers too, but they are also attached to the host network).

Digging deeper, I tried to start a Docker container which just has a JVM and Spark inside, ran it with --network host too and launched a Spark job from inside. This worked.

What might I be missing?

rabejens
  • 7,594
  • 11
  • 56
  • 104

1 Answers1

0

It turned out that I missed starting the shuffle service. I configured my custom jobserver container to use dynamic allocation and this needs the external shuffle service to be started.

rabejens
  • 7,594
  • 11
  • 56
  • 104