2

I have a few machines using which I am building a cluster. To improve performance we choose to use bonding on the Ethernet interfaces ( Each link is 1Gig). I have installed the ifenslave-2.6 module for Ubuntu 10.04 and I have configured the interfaces as well. The following is my configuration.

    # The loopback network interface
    auto lo
    iface lo inet loopback

    # The primary network interface
    auto eth0
    iface eth0 inet manual
    bond-master bond0

    auto eth1
    iface eth1 inet manual
    bond-master bond0

    auto bond0
    iface bond0 inet static
    address x.x.x.x
    gateway x.x.x.1
    netmask 255.255.255.0
    bond-mode 6
    bond-miimon 100
    bond-slaves none

Also tried bond-slaves eth0 eth1 instead of bond-master bond0. But did nothing.

So, as you can see I am running the bond on balance-alb mode or mode 6 to enable bonding for both upstream and downstream. Frequently ( frequency of about four days) , we see that the machines cannot talk to each other. No pings. Not visible on nmap (nmap -sP x.x.x.x). Sometimes some machines are visible while some are not. They all are clones but this behavior is strange. I first checked the arp -a to see if I am having trouble there. And there were a lot of incomplete entries. ( Usually occurs after nmap scan) but even after the timeouts occurred and the table settled I have trouble pinging them.

They are all on the same subnet. No firewall. All go to the same switch. My switch config is simple and as follows

    interface GigabitEthernet1/1
    !
    interface GigabitEthernet1/2
    !
    interface GigabitEthernet1/3
    switchport mode access
    spanning-tree portfast
    !
    interface GigabitEthernet1/4
    switchport mode access
    spanning-tree portfast
    ! 
    interface GigabitEthernet1/5
    switchport mode access
    spanning-tree portfast
    .
    .
    .
    !
    interface GigabitEthernet1/17
    switchport mode access
    spanning-tree portfast
    !
    interface GigabitEthernet1/18
    switchport mode access
    spanning-tree portfast

All of them are on VLAN 1. Port 1 goes to our router. And ports 3 - 18 are all configured the same way. Their mode set to access and spanning-tree set to portfast. Each machine takes up two links on this switch. The switch is a cisco 4948. I can perfectly talk to the machines from our gateway or machines outside our gateway. But getting them to talk internally is becoming a problem specifically because we plan to run Hadoop. Any help, nudge, opinion would really be helpful! Thank you.

Also, adding the ifenslave-2.6 -a output.

    ifenslave.c:v1.1.0 (December 1, 2003)
    o Donald Becker (becker@cesdis.gsfc.nasa.gov).
    o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).
    o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel
    (ctindel at ieee dot org).
    The result of SIOCGIFFLAGS on lo is 49.
    The result of SIOCGIFADDR is 00.00.7f.00.
    The result of SIOCGIFHWADDR is type 772  00:00:00:00:00:00.
    The result of SIOCGIFFLAGS on bond0 is 1443.
    The result of SIOCGIFADDR is 00.00.ffffff80.0a.
    The result of SIOCGIFHWADDR is type 1  00:1b:21:47:a0:c1.

Even if this fine could you let me know? Then the problem might just be somewhere else.

Bartha
  • 21
  • 3
  • could you post the results of `ifenslave -a` as well to see what the network startup script has made out of your config lines? – the-wabbit Feb 21 '13 at 17:32
  • Thanks @syneticon-dj . I have added the output from ifenslave -a. – Bartha Feb 21 '13 at 18:01
  • `cat /proc/net/bonding/bond0`? Preferably while the link is broken? You could be seeing a situation where one of the enslaved interfaces is unable to send or receive data but is not being recognized as such and remains an active bonding member, effectively swallowing half of your data transmission attempts. Try reproducing the issue and running `tcpdump` on the respective physical interfaces to verify that. Looking at the state of the FDB on the switch might be insightful too - issue `show mac-addr` and check the MAC addresses mapped to your two server connection ports. – the-wabbit Feb 21 '13 at 21:06

1 Answers1

0

We use LACP/802.3ad for our bonded connections throughout our network; from our SAN (4xGigE + 2xGigE) <-> servers (2xGigE) links to our inter-switch links (mixture of 2x and 4xGigE).

You get both bandwidth aggregation and redundancy benifits -- and the main benifit for me -- it is a damn sight easier to manage than static link aggregation.

While I know it doesn't directly answer your question, you might find it makes link aggregation a lot more manageable (or even work in the first place!).

-

My only other suggestion: Hook wireshark up and see what is going across the wire.

syserr0r
  • 500
  • 3
  • 10