1

I have been trying to set up a Hadoop cluster; I managed to get it running in pseudo-distributed mode, and my one machine wordcounted Tolstoy's War and Peace in about thirty seconds.

I am now trying to add a second machine to my cluster; To help set it up, I created a user group Hadoop with permissions to start, stop, and run jobs on the Hadoop server (though I left editing the configuration files to root only). I ensured that all members of the group hadoop could ssh using their public keys from the master node to the slave node. I installed hadoop 1.0.0.3 using dpkg. I edited the masters and slaves files correctly on the master node and the slave node, and changed the configurations to point to the correct NameNode and JobTracker:

In core-site.xml:
fs.default.name=hdfs://$MASTER:9000

In mapred-site.xml:
mapred.job.tracker=$MASTER:9001

where $MASTER is the hostname of my master machine.

My NN, SNN, and JobTracker are starting correctly; however, my slave node is not able to connect to my master node! This is the behavior I see in my DataNode log:

2012-05-25 09:36:23,390 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: $MASTER/10.23.95.197:9000. Already tried 0 time(s).
2012-05-25 09:36:23,390 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: $MASTER/10.23.95.197:9000. Already tried 1 time(s).
...
...
connect to server: $MASTER/10.23.95.197:9000. Already tried 9 time(s).
2012-05-25 09:36:31,394 INFO org.apache.hadoop.ipc.RPC: Server at $MASTER/10.23.95.197:9000 not available yet, Zzzzz...

over and over and over again. I see the same thing in the TaskTracker log, except the port number listed there is 9001. lsof tells me that the correct processes are listening on both ports. What is going wrong???

All logs from $MASTER can be found at http://pastebin.com/ZzyKBQVJ

Thanks; please let me know if you have any quetions.

ILikeFood
  • 399
  • 1
  • 5
  • 12
  • What gives $ lsof -i:9000 ? – Yohann May 25 '12 at 16:58
  • 1
    java PID 12324 TCP localhost:9000 LISTEN
    java PID 12324 TCP localhost:9000->localhost:52373 ESTABLISHED
    java PID 12598 TCP localhost:52373->localhost:9000 ESTABLISHED
    12324 is my NameNode process, 12598 is a DataNode running on $MASTER.
    – ILikeFood May 25 '12 at 17:03
  • Can you past all logs of the `start-all.sh` command ? (edit or pastbin) ; Check [this post](http://stackoverflow.com/questions/8076439/namenode-not-getting-started). – Yohann May 25 '12 at 17:52
  • I've posted all relevant logs from my $MASTER's /var/log/hadoop directory. Note that the DataNode and the TaskTracker I have posted are the ones running on $MASTER and have no problems connecting; the issue is with the ones running on $SLAVE which have logs exactly as I specified in the OP. $MASTER = ncoiasi1 $SLAVE = ncoiasi2 – ILikeFood May 25 '12 at 18:24
  • Can you telnet from the machine which isn't working to 10.23.95.197:9000? – mgorven May 26 '12 at 04:40
  • No, I cannot. (connection refused) – ILikeFood May 29 '12 at 20:38

4 Answers4

0

This issue is not usually caused by an issue in the Hadoop configuration, but more often in the network configuration of the cluster; in my case it was caused by this issue. If you are seeing this behavior, check your routing, /etc/hosts, etc. for problems before looking in the Hadoop files.

ILikeFood
  • 399
  • 1
  • 5
  • 12
0

I faced a similar problem while setting up 5 node cluster on Rackspace. I had double checked my /etc/hosts file. The issue was actually firewalls. The data nodes communicate to the master on port 9000. You will need to open that port for communication. You will also need to open port 50010 on the data nodes for the master to communicate for managing the task trackers.

In addition the master node should have port 9001 open as well for the job tracker communication.

Update iptables for all of these.

On Master nodes:

iptables -I INPUT -p tcp --dport 9000  -j ACCEPT
iptables -I INPUT -p tcp --dport 9001  -j ACCEPT
service iptables save
service iptables reload

On each of the datanodes/tasktracker:

iptables -I INPUT -p tcp --dport 50010  -j ACCEPT
service iptables save
service iptables reload
fuero
  • 9,591
  • 1
  • 35
  • 40
0

I was also getting the same error while running map reduce program in the cluster. sometimes job got successful and sometimes got failed.

My all systems in the cluster are locally connected.I resolved this problem by disabling firewall in all machines of the cluster by using this command:

$ systemctl disable firewalld or $ systemctl stop firewalld

use sudo before command if you dont have root access.I am using fedora 20. if you are using old version of linux then check for how to disable firewall in those systems.

I hope this will help you.

Regards, Sanjay Thakre

  • Depending on your security requirements, this might be too drastic an approach. It would surely be better to devise a way to allow the necessary traffic through the firewall, yes? – Felix Frank Aug 08 '14 at 09:37
0

I also faced similar issue. (I am using ubuntu 17.0) >sudo gedit /etc/hosts (in both master and slave machines)

127.0.0.1  localhost
192.168.201.101 master
192.168.201.102 slave1
192.168.201.103 slave2

secondly, > sudo gedit /etc/hosts.allow and add the entry : ALL:192.168.201.

disabled firewall > sudo ufw disable

got it working.