Weird Cisco switch problem - minutes with ultimate packetloss

Question

because of dead port were we pushed to replace our Linksys/Cisco SR2016 in datacenter.

So we looked to the stock of our supplier and the only interesting switch they had was the Cisco SLM2024 Smart switch - so we got it.

I went to the datacenter and configured (set the IP) and replaced the switch on Saturday night and ever since then have we got great trouble with it. Most of the time it works fine, but time to time it falls down for 1-20 minutes giving about 90% packetloss to all the connected servers - and when it is fine, the switch is working completely fine.

Screenshot

The other switch we have there is Linksys/Cisco SRW2016 and if I switch all the cables from the SLM2024 to this SRW2016, everything works fine... I'm pretty sure there were no loops.

The uplink cable goes to Catalyst 37xx family switch...

I asked the tech support of the telehouse if they haven't had similar problems in past, but they say they didn't - and I would like to be completely sure the problem is in the switch before I'll return it to the supplier (because I'm not completely sure how should I demonstrate it).

Thanks for your opinions!

score 1 · Accepted Answer · answered Dec 07 '10 at 14:18

If you check out your spanning tree statistics on the switch that should tell you more about the possibility of a loop, look for a topology age. When a loop occurs the topology is constantly re-building itself to compensate for the loop so a higher topology age (5hrs or more) indicates a loop free network.

I have seen some issues with the Broadcomm pro series NICS and my Alcatel Switches which sound very similar to what you are experiencing. It was very intermittent and frustrating until I found out that it was actually the auto negotiation on the switch.

I solved it by disabling the autoneg feature and hard coding the speed and duplex of all the ports. This is really a best practice in a server environment anyway I just got lazy and figured i'd let autoneg handle it.

The other thing you could do is run a packet capture on the segment and see if you are somehow getting reset frames or sequencing errors.

Also look at your flow control settings on the switch.

Thank you all for your help! Yesterday evening have I found out wrong settings in spanning tree in the older (srw2016) switch - I'v turned it on for the problematic ports and since then everything seems fine (knock knock..). All the time have I been looking at the setting of the new switch without a thought the problem could be in the old one... (which shouldn't have - I did the factory defaults reset like two days ago...) Nevermind, thank you all, especially Nick, the question is answered! — smachat, Dec 08 '10 at 06:58

score 0 · Answer 2 · answered Dec 07 '10 at 10:29

Are you seeing packet loss "between servers on the switch" or packet loss "between servers on the switch on one side and servers outside on the other, but the local servers have no issue between each other"?

If it is the latter, I'd start by hard-configuring speed and duplex on both ends of the uplink. Or, at least, make sure it's nailed (or auto-negotiate) on both ends.

score 0 · Answer 3 · answered Dec 07 '10 at 13:52

What you are describing sounds like a loop. I would double-check the cabling first. Aside from a loop or the switch itself being the problem, you might be able to take a look at which ports are pushing the most traffic through them as well and go from there to hunt down the devices on those ports. You can also try running Wireshark on one of the servers experiencing packet loss to see what the server is seeing on the network at the time of disruption.

Weird Cisco switch problem - minutes with ultimate packetloss

3 Answers3