10

I'm learning Spark and wanted to run the simplest possible cluster consisting of two physical machines. I've done all the basic setup and it seems to be fine. The output of the automatic start script looks as follows:

[username@localhost sbin]$ ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.master.Master-1-localhost.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/sername/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.out
username@192.168.???.??: starting org.apache.spark.deploy.worker.Worker, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out

so no errors here and seems that a Master node is running as well as two Worker nodes. However when I open the WebGUI at 192.168.???.??:8080, it only lists one worker - the local one. My issue is similar to that described here: Spark Clusters: worker info doesn't show on web UI but There's nothing going on in my /etc/hosts file. All it contains is:

127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6 

What am I missing? Both machines are running Fedora Workstation x86_64.

Community
  • 1
  • 1
Krzysiek Setlak
  • 311
  • 1
  • 4
  • 16
  • The simplest possible cluster in a standalone cluster. You might want to start with reading the following [documentation](http://spark.apache.org/docs/latest/spark-standalone.html). – eliasah Feb 16 '16 at 14:10
  • @eliasah Spark Standalone refers to the manager (as opposed to Yarn/Mesos), it has nothing to do with the number of nodes and it's stated in the very beginning of the very documentation you have linked. Please don't join the discussion if you have nothing to offer for it, it has negative influence on readability of the thread. – Krzysiek Setlak Feb 16 '16 at 14:29
  • @Sumit 1. download of precompiled Spark 1.6 with Hadoop 2.6 support, 2. setting up passwordless ssh access from the master machine to the slave one, 3. adding the slave machine to conf/slaves file 4. running start scripts I have done nothing else yet. – Krzysiek Setlak Feb 16 '16 at 14:30
  • Additional info: my issue is similar to that described here: [link](http://stackoverflow.com/questions/19851053/spark-clusters-worker-info-doesnt-show-on-web-ui) but There's nothing going on in my /etc/hosts file. All it contains is: 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 – Krzysiek Setlak Feb 16 '16 at 15:02
  • 1
    Could you start with describing your network configuration and adding logs? Also if you have some useful details to add just [edit](https://stackoverflow.com/posts/35434270/edit) the question. – zero323 Feb 16 '16 at 15:34
  • @zero323 Naturally, logs are here: Master: [link](http://pastebin.com/SXSnGMJx) Local worker: [link](http://pastebin.com/GCucwFfx) Remote worker: [link](http://pastebin.com/JKLxgY3P) As to the network configuration, please tell me what to provide. I'm using corporate LAN here which is a black box for me, but ssh and all that kind of stuff works fine. Logs show that the remote worker stubbornly looks for a Master locally even though SPARK_LOCAL_IP is set and spark.master in spark-defaults.conf is also defined. – Krzysiek Setlak Feb 17 '16 at 10:28
  • 1
    OK, so the problem is master configuration. Since its `/etc/hosts` provides only localhost configuration this information is passed to the remote worker. It tries to connect to the master on localhost (what is visible in its logs) and obviously fails. – zero323 Feb 17 '16 at 10:35
  • 1
    You have to either make your master reachable from the remote worker and update the configuration or you can try to forward all required ports over ssh. – zero323 Feb 17 '16 at 10:37
  • @zero323 Great! Could you please explain in more detail what do you mean by "You have to make your master reachable from the remote worker and update the configuration"? – Krzysiek Setlak Feb 17 '16 at 14:30
  • 1
    Either confiure SPARK_MASTER_IP so it points to an accessed by a worker or provide entry in /etc/hosts which corresponds to hostname and reachable (not localhost) IP. This should be enough. – zero323 Feb 17 '16 at 15:13
  • Thanks! Will try that tomorrow as soon as I get access to the hardware. I think your answer will be eligible for a final answer to the question. – Krzysiek Setlak Feb 17 '16 at 18:30

3 Answers3

5

Basically the source of the problems is that the master hostname resolves to the localhost. It is visible in both console output:

starting org.apache.spark.deploy.master.Master, logging to 
/home/.../spark-username-org.apache.spark.deploy.master.Master-1-localhost.out

where the last part corresponds to the hostname. You can see the same behavior in master log:

16/02/17 11:13:54 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.128.224 instead (on interface eno1)

and remote worker logs:

16/02/17 11:13:58 WARN Worker: Failed to connect to master localhost:7077
java.io.IOException: Failed to connect to localhost/127.0.0.1:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:7077

It means that remote worker tries to access a master on localhost and obviously fails. Even if worker was able to connect to the master I wouldn't work in a reverse direction for the same reason.

Some way to solve this problem:

  • provide a proper network configuration for both workers and master to ensure that hostnames used by each machine can be properly resolved to the corresponding IP addresses.
  • use ssh tunnels to forward all required ports between remote workers and master.
zero323
  • 322,348
  • 103
  • 959
  • 935
4

it seems like spark is very picky about IP and machine names. so, when starting your master, it will use your machine name to register spark master. if that name is not reachable from your workers, it will be almost impossible to reach.

a workaround is to start your master like this

SPARK_MASTER_IP=YOUR_SPARK_MASTER_IP ${SPARK_HOME}/sbin/start-master.sh

then, you will be able to connect your slaves like this

${SPARK_HOME}/sbin/start-slave.sh spark://**YOUR_SPARK_MASTER_IP**:PORT

and there you go!

dsncode
  • 2,407
  • 2
  • 22
  • 36
  • 3
    Thank you so much! Just to note that `SPARK_MASTER_IP` is now deprecated, we should use `SPARK_MASTER_HOST` instead. – mkaran Dec 12 '18 at 11:54
0

I had similar issue which got resolved by providing SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh. spark-env.sh essentially sets an environment variable SPARK_MASTER_IP which points to an IP to be tied to Master. Then start-master.sh reads this variable and ties Master to it. Now SPARK_MASTER_IP is visible outside the box where Master is running.

Salim
  • 2,046
  • 12
  • 13