0

It is a tower server machine located in datacenter. I have complete access to the machine.

I'm experiencing short dropouts, usually under 10 minutes a couple times a week or so. Unfortunately, last saturday, it has been off for like 5 minutes, on for 3, off for 10 and so on for about half an hour. I had to reboot the outlet since I simply couldn't afford to debug at the moment, there are websites I can't left inaccessible for long time. After like 40 minutes, when system booted up after hard-drive check, network was just fine again. Since that a day stable, then last night, another dropout 1-10 minutes long (I'm pinging from another machine every 10 minutes to get a status).

I have never found anything useful in logs, if I'm looking in right places. No load spikes either. I tried to get KVM connected several times during the dropout but it always go off before support can setup KVM. Only once I managed to get the access over KVM while dropout. I can confirm I couldn't reach the network but machine was working just fine. Unfortunately, it was too short to find out anything else.

Everytime, my housing provider isn't aware of any dropout on his behalf. I have several more machine there, they all run just fine. But still it could be misbehaving router or simply bad ethernet cable.

I need to find the cause of these dropouts because I can't afford much more website interruptions like that anymore.

Is there any nice tool (network monitor), I could use? I need something simple enough, so I could actually understand the log and point at specific cause.

Also, does it strike to you as software issue, machine HW issue or problem outside the machine, within the network? Is there even way to tell which one, if network goes offline just like that. For instance, I guess there won't be any preceding errors if it is a bad cable somewhere.

Saix
  • 111
  • 1
  • 3
  • 9

1 Answers1

2

First thing to check is if any link state changes are logged in the kernel log. You can view the most recent kernel log messages using the dmesg command. Look for messages looking similar to this

eth2: link down
eth2: link up, 100Mbps, full-duplex, lpa 0xC5E1
eth2: link down
eth2: link up, 100Mbps, full-duplex, lpa 0x45E1

If you see such messages, you need to check cabling between computer and switch. If you do not see this, you should check on a slightly higher layer of the stack.

Another problem could be that the MAC or IP address is duplicated. If that's the reason for your problem, running tcpdump on the server would likely show outgoing but not incoming packets. Though it may be the case, that the first outgoing packet clears the problem.

kasperd
  • 30,455
  • 17
  • 76
  • 124
  • There were indeed up and down logs on server and switch. I've known about these for some time already. I wasn't sure, if it points for hardware/cabling issue without doubts. Now I have managed to convince support, so they moved the server to another shelf within datacenter. It's been almost 24 hours without any dropout. So far it looks good. Bad cabling I guess. I might saved myself a lot of trouble long time ago, oh well. – Saix May 07 '14 at 18:49