8

I get from time to time the following errors in cloudera manager:

This DataNode is not connected to one or more of its NameNode(s). 

and

The Cloudera Manager agent got an unexpected response from this role's web server.

(usually together, sometimes only one of them)

In most references to these errors in SO and Google, the issue is a configuration problem (and the data node never connects to the name node)

In my case the data nodes usually connect at start up, but loose the connection after some time - so it doesn't appear to be a bad configuration.

  • Any other options?
  • Is it possible to force the data node to reconnect to the name node?
  • Is it possible to "ping" the name node from the data node (simulate the connection attempt of the data node)
  • Could it be some kind of resource problem (to many open files \ connections)?

sample logs (the errors vary from time to time)

2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.56.144.18:50010, dest: /10.56.144.28:48089, bytes: 132096, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1315770947_27, offset: 0, srvID: DS-990970275-10.56.144.18-50010-1384349167420, blockid: BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440, duration: 480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.144.18, storageID=DS-990970275-10.56.144.18-50010-1384349167420, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster16;nsid=7043943;c=0):Got exception while serving BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440 to /10.56.144.28:48089
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: host.com:50010:DataXceiver error processing READ_BLOCK operation  src: /10.56.144.28:48089 dest: /10.56.144.18:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
        at java.lang.Thread.run(Thread.java:662)
Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106
  • check whether the hadoop is in safemode or not sometime it happens becouse of this – Vikas Hardia Feb 25 '14 at 09:49
  • It's not correlated to safe mode (usually it happens when the servers aren't in safe mode) Some of the data nodes are connected, some not. There are no specific data nodes that have the tendency to loose connection (every time it's a different data node) – Ophir Yoktan Feb 25 '14 at 10:21
  • any connectivity issue? – Vikas Hardia Feb 25 '14 at 10:23
  • Not that i'm aware of (it tested ping and ssh between the servers) It could be a momentary loss of connection, but the data node doesn't reconnect to the name node afterwards. – Ophir Yoktan Feb 25 '14 at 10:30
  • I am seeing the same errors, however, in my case the datanode reconnects to the namenode. At any given time 5 to 10 datanodes report 'NameNode Connectivity' error on the CM and they keep changing. Were you able to find the root cause of this issue? – scott Dec 29 '15 at 23:47

4 Answers4

0

Hadoop uses specific ports to communicate between the DataNode and the NameNode. It could be that a firewall is blocking those specific ports. Check the default ports in the Cloudera WebSite and test the connectivity to the NameNode with specific ports.

ceedee
  • 196
  • 2
  • 4
0

If you're using Linux then please make sure that you have configured these properties correctly:

  1. Disable SELINUX

type the command getenforce on CLI and if it shows enforcing, means it is enabled. Change it fro /etc/selinux/config file.

  1. Disable Firewall

  2. Make sure you have NTP service installed.

  3. Make sure your server can SSH to all client nodes.

  4. Make sure all the nodes have FQDN(Fully Qualified Domain Name) and have an entry in /etc/hosts with name and IP.

If these settings are right in the place then please attach the log of any of your datanode which got disconnected.

Abhinav
  • 658
  • 1
  • 9
  • 27
0

I ran into this error

"This DataNode is not connected to one or more of its NameNode(s). "

and I solved it by turning off safe mode and restart HDFS service

Hadi GhahremanNezhad
  • 2,377
  • 5
  • 29
  • 58
0

I realize you took some steps to test this, but intermittent disconnects still make it sound like a Connectivity issue.

If nodes really don't come back after a disconnect, that may be a configuration issue, which could well be completely independent from the reason why they disconnect in the first place.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122