I have a small cluster with Cloudera Hadoop installation. After a few days, I noticed that there is constantly errors/dropped/frame when I run ifconfig -a
command. (From the highlevel perspective, map reduce job will run smoonthly without error and there are no errors from the end user perspective, I am wondering if I do something, will the performance be much better)
All the nodes, including the namenode, are installed and configured by the same redhat kickstart server, following the same recipe and I would say they are the "same". However, I did not notice any network errors on the namenode and the network errors exist on all the datanode consistently.
For example, my namenode looks like:
namenode.datafireball.com | success | rc=0 >>
eth4 Link encap:Ethernet HWaddr ...
inet addr:10.0.188.84 Bcast:10.0.191.255 Mask:...
inet6 addr: xxxfe56:5632/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:11711470 errors:0 dropped:0 overruns:0 frame:0
TX packets:6195067 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6548704769 (6.0 GiB) TX bytes:12093046450 (11.2 GiB)
Data node:
datanode1.datafireball.com | success | rc=0 >>
eth4 Link encap:Ethernet HWaddr ...
inet addr:10.0.188.87 Bcast:10.0.191.255 Mask:...
inet6 addr: xxxff24/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:27474152 errors:0 dropped:36072 overruns:36072 frame:36072
TX packets:28905940 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:158509736560 (147.6 GiB) TX bytes:180857576718 (168.4 GiB)
I also did some stress test following Michael's tutorial and I can see the errors increasing as the job goes. So it is some error left when I first set up.
FYI, we have two NIC cards in one box, the first 4 ports are embedded nic card 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
which we are not using at all, we are using 0e:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
which is the 10Gb NIC.
This is the output of the firmware and some general info for NIC card:
$ ethtool -i eth4
driver: mlx4_en
version: 2.0 (Dec 2011)
firmware-version: 2.8.600
bus-info: 0000:0e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
I am so surprised to find that the data node will have network errors and the namenode doesn't since they have the same set up and configuration. Can anyone give me some guidance?