13

I want to install Spark Standlone mode to a Cluster with my two virtual machines.
With the version of spark-0.9.1-bin-hadoop1, I execute spark-shell successfully in each vm. I follow the offical document to make one vm(ip:xx.xx.xx.223) as both Master and Worker and to make the other(ip:xx.xx.xx.224) as Worker only.
But the 224-ip vm cannot connect the 223-ip vm. Followed is 223(Master)'s master log:

[@tc-52-223 logs]# tail -100f spark-root-org.apache.spark.deploy.master.Master-1-tc-52-223.out
Spark Command: /usr/local/jdk/bin/java -cp :/data/test/spark-0.9.1-bin-hadoop1/conf:/data/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.11.52.223 --port 7077 --webui-port 8080

log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 22:17:03 INFO Master: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:03 INFO Master: Starting Spark master at spark://10.11.52.223:7077
14/04/14 22:17:03 INFO MasterWebUI: Started Master web UI at http://tc-52-223:8080
14/04/14 22:17:03 INFO Master: I have been elected leader! New state: ALIVE
14/04/14 22:17:06 INFO Master: Registering worker tc-52-223:20599 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisteredWorker] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:17:26 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:26 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorkerFailed] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:17:46 INFO Master: Registering worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:46 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorkerFailed] from Actor[akka://sparkMaster/user/Master#1972530850] to Actor[akka://sparkMaster/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker@tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker@tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.11.52.224%3A61550-1#646150938] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker@tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@10.11.52.223:7077] -> [akka.tcp://sparkWorker@tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker@tc_52_224:21371 got disassociated, removing it.
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@10.11.52.223:7077] -> [akka.tcp://sparkWorker@tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@10.11.52.223:7077] -> [akka.tcp://sparkWorker@tc_52_224:21371]: Error [Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker@tc_52_224:21371]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: tc_52_224/10.11.52.224:21371
]
14/04/14 22:18:06 INFO Master: akka.tcp://sparkWorker@tc_52_224:21371 got disassociated, removing it.
14/04/14 22:19:03 WARN Master: Removing worker-20140414221705-tc_52_224-21371 because we got no heartbeat in 60 seconds
14/04/14 22:19:03 INFO Master: Removing worker worker-20140414221705-tc_52_224-21371 on tc_52_224:21371  

Followed is 223(Worker)'s worker log:

14/04/14 22:17:06 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:06 INFO Worker: Starting Spark worker tc-52-223:20599 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Worker: Spark home: /data/test/spark-0.9.1-bin-hadoop1
14/04/14 22:17:06 INFO WorkerWebUI: Started Worker web UI at http://tc-52-223:8081
14/04/14 22:17:06 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:06 INFO Worker: Successfully registered with master spark://xx.xx.52.223:7077

Followed is 224(Worker)'s work log:

Spark Command: /usr/local/jdk/bin/java -cp :/data/test/spark-0.9.1-bin-hadoop1/conf:/data/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://10.11.52.223:7077 --webui-port 8081
========================================

log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 22:17:06 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 22:17:06 INFO Worker: Starting Spark worker tc_52_224:21371 with 1 cores, 4.0 GB RAM
14/04/14 22:17:06 INFO Worker: Spark home: /data/test/spark-0.9.1-bin-hadoop1
14/04/14 22:17:06 INFO WorkerWebUI: Started Worker web UI at http://tc_52_224:8081
14/04/14 22:17:06 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:26 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:17:46 INFO Worker: Connecting to master spark://xx.xx.52.223:7077...
14/04/14 22:18:06 ERROR Worker: All masters are unresponsive! Giving up.

Followed is my spark-env.sh:

JAVA_HOME=/usr/local/jdk
export SPARK_MASTER_IP=tc-52-223
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=4g
export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
export SPARK_LOCAL_IP=tc-52-223

I have googled many solutions, but they cant work. Please help me.

JimLohse
  • 1,209
  • 4
  • 19
  • 44
FatGhosta
  • 173
  • 1
  • 2
  • 7

7 Answers7

9

I'm not sure if this is the same issue I encountered but you may want to try setting SPARK_MASTER_IP the same as what spark binds to. In your example is looks like it would be 10.11.52.223 and not tc-52-223.

It should be the same as what you see when you visit the master node web UI on 8080. Something like: Spark Master at spark://ec2-XX-XX-XXX-XXX.compute-1.amazonaws.com:7077

Klugscheißer
  • 1,575
  • 1
  • 11
  • 24
  • @FatGhosta this probably solved your problem too? I am trying to track down a similar issue and I have found two questions where the resolution is unknown. Perhaps we will never know! I posted an in-depth answer on this other question http://stackoverflow.com/questions/28453835/apache-sparck-error-could-not-connect-to-akka-tcp-sparkmaster/34499020#34499020], I think it's safest to stick with all IPs if you can, but you were using a much older version. How did this work out? – JimLohse Dec 28 '15 at 21:18
6

If you are getting a "Connection refused" exception, You can resolve it by checking

=> Master is running on the specific host

netstat -at | grep 7077

You will get something similar to:

tcp        0      0 akhldz.master.io:7077 *:*             LISTEN  

If that is the case, then from your worker machine do a host akhldz.master.io ( replace akhldz.master.io with your master host.If something goes wrong, then add a host entry in your /etc/hosts file)
telnet akhldz.master.io 7077 ( If this is not connecting, then your worker wont connect either. )

=> Adding Host entry in /etc/hosts

Open /etc/hosts from your worker machine and add the following entry (example)

192.168.100.20   akhldz.master.io

PS :In the above case Pillis was having two ip addresses having same host name eg:

192.168.100.40  s1.machine.org
192.168.100.41  s1.machine.org

Hope that help.

Yves M.
  • 29,855
  • 23
  • 108
  • 144
AkhlD
  • 2,596
  • 2
  • 16
  • 15
  • 1
    This looks like it's copied from the apache spark user mailing list, http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAKXOu_BkoUxwLZ9cvD5m_di3wLLk2AShjkh1CCAxdM3cNYn8uw@mail.gmail.com%3E. It might be better to add a link to that under the question and delete this answer? – TooTone Jan 05 '15 at 11:25
  • 1
    I answered there in the mailing list =P – AkhlD Jan 06 '15 at 11:52
  • 1
    :), but this looks like an answer to the question on the mailing list, not to the question here (it even mentions the OP on the mailing list rather than the stackoverflow OP). – TooTone Jan 06 '15 at 12:41
2

There's a lot of answers and possible solutions, and this question is a bit old, but in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.

So in your spark-env.sh get rid of SPARK_WORKER_IP and change SPARK_MASTER_IP to an IP address, not a hostname.

I have treated this more at length in this answer.

For more completeness here's part of that answer:

Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.

You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.

I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.

Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.

Community
  • 1
  • 1
JimLohse
  • 1,209
  • 4
  • 19
  • 44
0

set the port for spark worker also, Eg.: SPARK_WORKER_PORT=5078 ... check thespark-installation link for correct installation

Arnav
  • 153
  • 1
  • 1
  • 11
0

basically your ports are blocked so communication from master to worker is cut down. check here https://spark.apache.org/docs/latest/configuration.html#networking

In the "Networking" section, you can see some of the ports are by default random. You can set them to your choice like below:

val conf = new SparkConf() 
    .setMaster(master) 
    .setAppName("namexxx") 
    .set("spark.driver.port", "51810") 
    .set("spark.fileserver.port", "51811") 
    .set("spark.broadcast.port", "51812") 
    .set("spark.replClassServer.port", "51813") 
    .set("spark.blockManager.port", "51814") 
    .set("spark.executor.port", "51815") 
keypoint
  • 2,268
  • 4
  • 31
  • 59
0

I my case, I could overcome the problem as "adding entry of hostname and IP adres of localhost to /etc/hosts file" as follows:

For a cluster, master has the /etc/hosts content as follows:

127.0.0.1       master.yourhost.com localhost localhost4 localhost.localdomain
192.168.1.10    slave1.yourhost.com
192.168.1.9     master.yourhost.com **# this line solved the problem**

Then I also do the SAME THING on slave1.yourhost.com machine.

Hope this helps..

0

I had faced same issue . you can resolve it by below procedure , first you should go to /etc/hosts file and comment 127.0.1.1 address . then you should go towards spark/sbin directory , then you should started spark session by these command ,

./start-all.sh 

or you can use ./start-master.sh and ./start-slave.sh for the same . Now if you will run spark-shell or pyspark or any other spark component then it will automatically create spark context object sc for you .

Shubham Sharma
  • 2,763
  • 5
  • 31
  • 46