5

We recently set up a new Ubuntu 12.04LTS server on our network. It's not fully configured so it's not doing much beyond sshd and a default apache2 install. But this evening it appears to have crashed. It wasn't responding to the network or the keyboard. But the worst part is, it took down the entire network.

My knowledge of the network stack below OSI layer 3 is very limited, so the rest confuses me. When this machine was physically connected to the network, no other machine could connect to the outside internet. When things were broken, running arp showed that our gateway's IP address (10.0.1.1) was listed as "invalid." Unplugging the server from the network fixed the problem, and plugging it back in broke it again. So the crashed server was advertising itself as owning the gateway's IP address?

There's nothing at all in syslog during the time when it was causing problems. Any ideas about how to figure out what went wrong or what we can do to prevent it from happening again? I'm hesitant to even put the machine back on the network right now.

**** Update ****

It crashed again, and I ran tcpdump -penn arp (thanks bahamat!) for several minutes and got this... (timestamps and duplicate lines removed)

00:1e:65:f8:dc:24 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.1 tell 10.0.2.191, length 46
00:1e:65:f8:dc:24 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.44 tell 10.0.2.191, length 46
60:d8:19:d4:71:d6 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.0.1.1 tell 10.0.2.125, length 46
d4:9a:20:04:e9:78 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.1 tell 192.168.1.100, length 28

**** Update 2 ****

When the network is functioning properly, arping -c4 10.0.1.1 returns this:

ARPING 10.0.1.1
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=0 time=267.982 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=1 time=422.955 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=2 time=299.215 usec
60 bytes from c0:c1:c0:77:25:8e (10.0.1.1): index=3 time=366.926 usec

--- 10.0.1.1 statistics ---
4 packets transmitted, 4 packets received,   0% unanswered (0 extra)

When the bad server is plugged in, arping -c4 10.0.1.1 returns:

ARPING 10.0.1.1

--- 10.0.1.1 statistics ---
4 packets transmitted, 0 packets received, 100% unanswered (0 extra)

**** Context ****

  • 10.0.x.x is the main subnet.
  • 10.0.1.1 is the main internet gateway
  • 10.0.1.44 is a printer
  • 10.0.2.* devices are all laptops / workstations
  • I have no idea what's using the 192.168.x.x subnet -- your guesses are at least as good as mine. A VM on a workstation? A misconfigured WAP? Somebody re-sharing wifi? A machine that failed to DHCP?
  • The offending ubuntu server's MAC address ends in cd:80 so isn't listed in the dump. It should DHCP to 10.0.3.3

Thanks for any help. This ARP stuff is all voodoo to me. Packets just go to IP addresses, right? ;)

Leopd
  • 1,757
  • 4
  • 24
  • 30
  • 2
    On another host run `tcpdump -penn arp` then plug the machine in and see what's going on. If you describe what's happening someone might be able to provide more help. – bahamat Jul 09 '12 at 04:50
  • `tcpdump` output added. Thanks bahamat! – Leopd Jul 09 '12 at 15:44
  • Could you add the output of `arping -c 4` to the question? The tcpdump log shows that there are requests but not who is answering. And maybe the output of `arp -a`might be helpfull. – Christopher Perrin Jul 09 '12 at 17:46
  • 1
    I'm sorry I meant `arping -c4 `. – Christopher Perrin Jul 10 '12 at 17:55
  • Okay, I tried that... – Leopd Jul 11 '12 at 21:37
  • That is strange. It seems like the gateway is somehow held from responding to the arp requests. I can just speculate. Possibly the network device, on the failed server, is in a loop and spams the gateway with (ARP-)requests. The next logical step would be to plug another computer to the failed device and monitor every bit that is get sent. Maybe you get some information. – Christopher Perrin Aug 09 '12 at 14:35

2 Answers2

2

Just had the exact same issue. All of a sudden most of my network went down. The only part still working was WiFi and I could only connect to the router, could not reach WAN and none of the wired LAN computers answered my pings. After rebooting the router several times to no avail I resorted to unplugging all ethernet cables. All of sudden it worked again, reconnected the cables and everything went down. After a bit of trial and error I found the culprit; my headless Ubuntu 12.04 server. I could kill the network by plugging it in and revive it by unhooking it. Eventually I resorted to pull the power. When it came back up it played nicely, I checked syslog and to my great surprise there was absolutely nothing there;

Sep 17 21:21:44 *** Normal event occuring
Sep 17 21:22:16 *** Normal event occuring
Sep 17 21:22:48 *** Normal event occuring
Sep 17 21:23:20 *** Normal event occuring
Sep 17 22:45:36 Atlas kernel: imklog 5.8.6, log source = /proc/kmsg started.
Sep 17 22:45:36 Atlas rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="1048" x-info="http://www.rsyslog.com"] start
Sep 17 22:45:36 Atlas rsyslogd: rsyslogd's groupid changed to 103
Sep 17 22:45:36 Atlas rsyslogd: rsyslogd's userid changed to 101

Strange indeed and kind of worrying. Not only did my server that has been stable since I first fired it up go down but it managed to bring the rest of the network with it.

Petter E
  • 21
  • 2
0

Well, I can tell you that the machine on 192.168.1.x is a MAC address issued to Apple.

Are you receiving the ARP requests on the gateway itself? What about dumping traffic from the switch? It's sounding like the Ubuntu machine might be getting sent ARPs that it shouldn't and it could be confusing the switch.