I have a few machines using which I am building a cluster. To improve performance we choose to use bonding on the Ethernet interfaces ( Each link is 1Gig). I have installed the ifenslave-2.6 module for Ubuntu 10.04 and I have configured the interfaces as well. The following is my configuration.
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet manual
bond-master bond0
auto eth1
iface eth1 inet manual
bond-master bond0
auto bond0
iface bond0 inet static
address x.x.x.x
gateway x.x.x.1
netmask 255.255.255.0
bond-mode 6
bond-miimon 100
bond-slaves none
Also tried bond-slaves eth0 eth1 instead of bond-master bond0. But did nothing.
So, as you can see I am running the bond on balance-alb mode or mode 6 to enable bonding for both upstream and downstream. Frequently ( frequency of about four days) , we see that the machines cannot talk to each other. No pings. Not visible on nmap (nmap -sP x.x.x.x). Sometimes some machines are visible while some are not. They all are clones but this behavior is strange. I first checked the arp -a to see if I am having trouble there. And there were a lot of incomplete entries. ( Usually occurs after nmap scan) but even after the timeouts occurred and the table settled I have trouble pinging them.
They are all on the same subnet. No firewall. All go to the same switch. My switch config is simple and as follows
interface GigabitEthernet1/1
!
interface GigabitEthernet1/2
!
interface GigabitEthernet1/3
switchport mode access
spanning-tree portfast
!
interface GigabitEthernet1/4
switchport mode access
spanning-tree portfast
!
interface GigabitEthernet1/5
switchport mode access
spanning-tree portfast
.
.
.
!
interface GigabitEthernet1/17
switchport mode access
spanning-tree portfast
!
interface GigabitEthernet1/18
switchport mode access
spanning-tree portfast
All of them are on VLAN 1. Port 1 goes to our router. And ports 3 - 18 are all configured the same way. Their mode set to access and spanning-tree set to portfast. Each machine takes up two links on this switch. The switch is a cisco 4948. I can perfectly talk to the machines from our gateway or machines outside our gateway. But getting them to talk internally is becoming a problem specifically because we plan to run Hadoop. Any help, nudge, opinion would really be helpful! Thank you.
Also, adding the ifenslave-2.6 -a output.
ifenslave.c:v1.1.0 (December 1, 2003)
o Donald Becker (becker@cesdis.gsfc.nasa.gov).
o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).
o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel
(ctindel at ieee dot org).
The result of SIOCGIFFLAGS on lo is 49.
The result of SIOCGIFADDR is 00.00.7f.00.
The result of SIOCGIFHWADDR is type 772 00:00:00:00:00:00.
The result of SIOCGIFFLAGS on bond0 is 1443.
The result of SIOCGIFADDR is 00.00.ffffff80.0a.
The result of SIOCGIFHWADDR is type 1 00:1b:21:47:a0:c1.
Even if this fine could you let me know? Then the problem might just be somewhere else.