I'm trying to diagnose a network related problem - please understand these points before suggesting an answer (apologies if more information is required, I will add anything people ask).
- We have a server only network (5 app server, 4 db servers, few other servers) that appears to be suffering packet loss between servers
- I can see this happening on wireshare - there are a lot of TCP Retransmissions, TCP_Out-of-Order, TCP DupACK and I think some TCP_ZeroWindow packets too.
- There appears to be a lot of Bad Checksums on the IP protocol
- I think the network adapters have a very constant and high (90-100%) load due to the extra retries caused by this packet loss
- As the external requests on this network increase (to the app servers) the network performance decreases
- the app servers generate their own traffic when used by the external request
- The external requests come through a core router and the network is on it's own segment
- This high load "magically" dissapeared after 1-2 days, I say magically as we where only monitoring at the adapters at the time the load dropped, there is still packet loss showing in wireshark, albeit a lesser amount.
- Nothing points to a compromised server.
- Unfortunately we don't have physical access to any of the hardware
- We can't disrupt the current service
Given the above, what is the best way to determine what is causing the packet loss (we expect it to be a managed switch).
Is there any software that can provide us with empirical evidence of what is causing the issues?
Thanks in advance