0

I have two machines in one network. In master-machine I am running ./sbin/spark-master.sh and in the other one (let's call it the slave-machine) I am running ./bin/spark-shell --master spark://master-machine:7077. But I got errors in slave-machine running spark-shell script.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/07 18:56:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/07 18:56:38 WARN Utils: Your hostname, <HOSTNAME> resolves to a loopback address: 127.0.0.1; using 192.168.0.68 instead (on interface wlan0)
16/09/07 18:56:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/09/07 18:56:39 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-machine's IP>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
    at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /192.168.43.27:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
    ... 4 more
Caused by: java.net.ConnectException: Connection refused: /<master-machine's IP>:7077
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    ... 1 more

I tried its IP as well but it does not work.

In master-machine netstat -nltu shows the following

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address            Foreign Address         State      
tcp        0      0 0.0.0.0:22               0.0.0.0:*               LISTEN     
tcp6       0      0 :::8080                  :::*                    LISTEN     
tcp6       0      0 127.0.1.1:6066           :::*                    LISTEN     
tcp6       0      0 :::22                    :::*                    LISTEN     
tcp6       0      0 127.0.1.1:7077           :::*                    LISTEN     
udp        0      0 0.0.0.0:27535            0.0.0.0:*                          
udp        0      0 0.0.0.0:68               0.0.0.0:*                          
udp6       0      0 :::54678                 :::*    

The port 7077 is accepting packets only from localhost not from other machines in the network.

I tried most of the solutions online but none of them worked. Until I set SPARK_MASTER_HOST to the IP of the master-machine and then the netstat -nltu shows

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address            Foreign Address         State      
tcp        0      0 0.0.0.0:22               0.0.0.0:*               LISTEN     
tcp6       0      0 :::8080                  :::*                    LISTEN     
tcp6       0      0 <master-machine IP>:6066 :::*                    LISTEN     
tcp6       0      0 :::22                    :::*                    LISTEN     
tcp6       0      0 <master-machine IP>:7077 :::*                    LISTEN     
udp        0      0 0.0.0.0:27535            0.0.0.0:*                          
udp        0      0 0.0.0.0:68               0.0.0.0:*                          
udp6       0      0 :::54678                 :::* 

The problem is now solved but what is the best practice to solve this ? The reason it is weird for me is that reasonably you want port 7077 to accept packets not only from localhost but other machines as well.

PS: The spark version is spark-2.0.0-bin-hadoop2.7

Arsinux
  • 173
  • 1
  • 4
  • 13
  • 1
    By default master is bind to its local address, you need to explicitly configure the master ip binding. probably you can get help from here : http://stackoverflow.com/questions/37190934/spark-cluster-master-ip-address-not-binding-to-floating-ip – Abhi Sep 07 '16 at 20:48
  • 1
    I know that and I have already seen the link you sent. My question is why it should be bind to local adress in the first place or why there are no documents saying you should bind master to its network ip instead of loopback. As mentioned above, the reason why I am concerned is because the main intention of Spark is to work on several machines and yet we have loopback as its ip for port 7077 and there are no instructions trying to state the procedure. – Arsinux Sep 08 '16 at 08:03

0 Answers0