2

I have a small cluster with Cloudera Hadoop installation. After a few days, I noticed that there is constantly errors/dropped/frame when I run ifconfig -a command. (From the highlevel perspective, map reduce job will run smoonthly without error and there are no errors from the end user perspective, I am wondering if I do something, will the performance be much better)

All the nodes, including the namenode, are installed and configured by the same redhat kickstart server, following the same recipe and I would say they are the "same". However, I did not notice any network errors on the namenode and the network errors exist on all the datanode consistently.

For example, my namenode looks like:

namenode.datafireball.com | success | rc=0 >>
eth4      Link encap:Ethernet  HWaddr ...  
          inet addr:10.0.188.84  Bcast:10.0.191.255  Mask:...
          inet6 addr: xxxfe56:5632/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:11711470 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6195067 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6548704769 (6.0 GiB)  TX bytes:12093046450 (11.2 GiB)

Data node:

datanode1.datafireball.com | success | rc=0 >>
eth4      Link encap:Ethernet  HWaddr ...  
          inet addr:10.0.188.87  Bcast:10.0.191.255  Mask:...
          inet6 addr: xxxff24/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:27474152 errors:0 dropped:36072 overruns:36072 frame:36072
          TX packets:28905940 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:158509736560 (147.6 GiB)  TX bytes:180857576718 (168.4 GiB)  

I also did some stress test following Michael's tutorial and I can see the errors increasing as the job goes. So it is some error left when I first set up.

FYI, we have two NIC cards in one box, the first 4 ports are embedded nic card 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) which we are not using at all, we are using 0e:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) which is the 10Gb NIC.

This is the output of the firmware and some general info for NIC card:

$ ethtool -i eth4
driver: mlx4_en
version: 2.0 (Dec 2011)
firmware-version: 2.8.600
bus-info: 0000:0e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

I am so surprised to find that the data node will have network errors and the namenode doesn't since they have the same set up and configuration. Can anyone give me some guidance?

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178

1 Answers1

0

B.Mr.W.!

Answering your question, my hypothesis based on the functionality of each component, the namenode only handles metadata information, managing only the location of blocks and servers, with requests and responses using little bandwidth on the network. The datanode is responsible for massive data, using the network bandwidth in its entirety, since it transfers the 'big' data, hence the dropped packets.

I suggest you to evaluate the switch port connected to this server network interface configuration, if the jumbo frame is enabled (MTU = 9000).

The same configuration must be verified in the network interface settings for the server.

A great way to check if the configuration is missing at one of the points is checking if there are dropped packages with the command 'ifconfig -a', executed in server SO console:

[root@<hostname> ~]# ifconfig -a
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
inet Ip.Ad.re.ss netmask net.m.as.k broadcast bro.d.ca.st
ether XX:XX:XX:XX:XX:XX txqueuelen 1000 (Ethernet)
RX packets 522849928 bytes 80049415915 (74.5 GiB)
RX errors 274721 dropped 276064 overruns 0 frame 274721
TX packets 520714273 bytes 72697966414 (67.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

In this case, jumbo frame is configured only in server network interface.

Regards, Caseiro.