0

I was hoping someone out there would be able to look at this and let me know what I have missed. I have 4 machines and for some reason, only 1 of them can talk to the other 3 via their private IP address (on eth1).

The 4 machines are:

    mach01    10.176.193.17
    mach02    10.176.193.92
    mach03    10.176.193.27
    mach04    10.176.195.9

All of the machines are Debian lenny. From mach02, I can ping the other 3 machines no problem, and from the other machines, I can ping mach02. However, from mach01, mach03 and mach04 I can only ping mach02.

The results from "iptables --list" on all machines is:

    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination

    Chain FORWARD (policy ACCEPT)
    target     prot opt source               destination

    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination

So I do not believe there is a firewall issue. The routing tables for eth1 on all machines is:

    10.176.192.0    *               255.255.224.0   U     0      0        0 eth1
    10.191.192.0    10.176.192.1    255.255.192.0   UG    0      0        0 eth1
    10.176.0.0      10.176.192.1    255.248.0.0     UG    0      0        0 eth1

So that looks fine as well. For some reason, ARP requests are failing from mach03 to anywhere other than mach02, and similarly for other machines.

    mach03$ arping -c 1 -I eth1 10.176.193.17
    ARPING 10.176.193.17

    --- 10.176.193.17 statistics ---
    1 packets transmitted, 0 packets received, 100% unanswered

I do not see any reason why ARP would fail like this, and have run out of ideas and places to look. Does anyone else with more experience in troubleshooting networking have any ideas?

Thanks

EDIT

After trying to ping mach01 from mach03, the following is in the ARP cache:

    $ arp -a
    ? (10.176.193.17) at <incomplete> on eth1
    ? (67.23.45.1) at 00:00:0C:07:AC:01 [ether] on eth0

And the other way around (so from mach03 to mach01):

    ? (10.176.193.92) at 40:40:FA:77:D7:94 [ether] on eth1
    ? (10.176.193.27) at <incomplete> on eth1
    ? (67.23.45.1) at 00:00:0C:07:AC:01 [ether] on eth0

And more details on eth1:

    $ ip addr show dev eth1
    3: eth1:  mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
        link/ether 40:40:16:e0:f3:dd brd ff:ff:ff:ff:ff:ff
        inet 10.176.193.17/19 brd 10.176.223.255 scope global eth1
        inet6 fe80::4240:16ff:fee0:f3dd/64 scope link
           valid_lft forever preferred_lft forever
Peter Sankauskas
  • 698
  • 6
  • 11
  • 21
  • I'm confused by your edit - you say the first one is "mach01 from mach03" and the second one is "from mach03 to mach01" which is the same thing ... could you fix your edit? :) – Neobyte Aug 20 '09 at 05:09
  • I suggest you choose two of the problematic hosts (say mach03 and 04), kill all routes on eth1, and then add back in just the 10.176.192.0 route. This simplifies troubleshooting. Once that is working you can add the others back in and see where things break. – Neobyte Aug 20 '09 at 05:15

6 Answers6

3

Well you've discounted firewalling so...

The only things I can think of with my extremely limited networking knowledge are:

  1. Broadcast address is wrong on mach01/03/04.
  2. Routing order is messed up - in the example above, the 3rd entry overlaps the range of the 1st entry. Are the order of the routing entries identical on all machines? Maybe some machines are arp-ing on the wrong network.

Does 'arping' work from 01/03/04 to 02 or are they updating their arp cache courtesy of incoming broadcast packets from 02?

Neobyte
  • 3,179
  • 1
  • 26
  • 31
  • Both. mach01 is updating it's arp cache with mach02's details from incoming broadcasts, but mach03/04 do not seem to be receiving the same broadcasts and ARP results in incomplete. – Peter Sankauskas Aug 19 '09 at 16:37
1

It's a bit strange, to start I would try to run tcpdump on mach01, mach02 and mach03 to see if mach01 and mach02 iaregetting ARP Request from mach03 when you try to ping mach01, if it's replying (for mach03) or not, etc.
Did you know if there can be a transparent firewall between hosts ? This could explain what you're seeing.
What is the network topology ? is there many switch between hosts or just one ? What kind of switch ?

radius
  • 9,633
  • 25
  • 45
  • These machines are in Rackspace Cloud, so the networking topology is unknown. tcpdump reveals that mach01 does not receive any packs when arpinging from mach03. This leads me to believe the routing it screwed up. – Peter Sankauskas Aug 19 '09 at 16:40
1

Did you copy/paste this info, or try to type it? You have "193" in your network, except one machine shows 195. Then you show 192 in your routing tables.

gbarry
  • 615
  • 5
  • 11
1

First of all, pick two machines that can't talk to each other and troubleshoot them first. Pick one of those two that can't talk to the other one and we'll use that one.

Your routing table looks strange, you've got a gateway flag set for two routes, the second of which overlaps with your original network route. Have you set static routes for some reason?

First of all, flush your routing table:

# ip route flush table all

Secondly, add back in the route for the LAN subnet only

# ip route add 10.176.192.0/19 dev eth0

Are those machines still uncontactable?

If that doesn't work, please paste the output of

# ip addr
# brctl show

My guess is that some VPN software / virtualization software / you or a colleague has modified your routes incorrectly.

Philip Reynolds
  • 9,799
  • 1
  • 34
  • 33
  • You use the new `ip route` command to flush routing tables, then you use the old and obsolete `route` command to add the route back? Just use the new command for everything: `ip route add 10.176.192.0/19 dev eth0` – Juliano Aug 19 '09 at 14:05
  • Old habits die hard. Edited post :) – Philip Reynolds Aug 19 '09 at 15:36
  • `ip route flush table all` is not an option for a machine I only have SSH access to. But I will try `ip route flush dev eth1` – Peter Sankauskas Aug 19 '09 at 17:24
  • Ok. Tried the flush and adding the route (on eth1 not eth0) and still not contactable. Even tried doing this on both machines without luck. – Peter Sankauskas Aug 19 '09 at 17:31
0

Please can you paste the full host routing table from one of the hosts? It's possible that there is a more specific route for another interface.

Also, please can you post the output of 'arp -a' immediately after one of the failed 'arping' attempts? This should show an incomplete entry for the IP address you tried to arping on [eth1], and will confirm that your host routing is configured correctly.

Murali Suriar
  • 10,296
  • 8
  • 41
  • 62
0

It turns out I discovered an issue with Rackspace Cloud Server's networking. The issue was escalated and has been resolved.

I would like to thank everyone who responded.

Peter Sankauskas
  • 698
  • 6
  • 11
  • 21