Trying to identify bandwidth usage/spike on Linux Debian VM

Question

I am trying to identify an issue with a Linux (Debian) VM running under Hyper-V on Windows Server 2016.

The issue is that, at random intervals I'm seeing massive spikes of bandwidth which is maxing out the physical servers network port causing loss of connectivity to the physical server.

I have tried restricting the bandwidth to the VM within the 'Bandwidth Management' part of Hyper-V Server but it is having no effect.

I have also tried using Wondershaper (https://github.com/magnific0/wondershaper) which, while seems to be limiting 'some' traffic, I am still seeing the huge spikes so it is not able to pick up whatever is causing this.

I have also tried using ethtool to change the interface speed, but the issue persists.

At this stage I'm at a loss to try and figure out what is causing this and how to prevent it.

Could anyone suggest anything else I can try to identify what could be causing this?

Thanks.

UPDATE : I installed netatop on the VM and caught the issue happening (below), but again, it doesn't show what is happening and where the bandwidth is being consumed (unless I'm missing something). You can see the issue, but how can it go over the network interface speed of 300Mbps I have set? It's recording 965Mbps, how can that be?

UPDATE :

This is the traffic seen in the tcpdump capture when the issue happened, so definitely a malicious attack, there were thousands of these entries, from many different IP addresses, but all against the same website.com domain and all with pretty much the same payload.

0.000013 31.xxx.xxx.xxx 185.xxx.xxx.xxx DNS 1034 Standard query response 0x9764 ANY website.com RRSIG RRSIG RRSIG NSEC3PARAM website.com DNSKEY DNSKEY DNSKEY RRSIG RRSIG RRSIG RRSIG AAAA 2600:1f18:46d5:xxxx:xxxx:xxxx:91c8:a5b DNSKEY RRSIG RRSIG RRSIG RRSIG RRSIG SOA ns0.website.com TXT TXT TXT TXT TXT TXT TXT

If you can catch it when its happening use tcpdump on the Debian box. Also, have you looked to see if the time is predictable? Maybe it correlates with automatic updates? Lastly, look at your logs on the Debian box. — davidgo, Mar 28 '20 at 23:53
~davidgo It is unfortunately not predictable, maybe not at all, then 3 times a day.All logs analysed, nothing in there at the time it happens, checked every single log possible. I just can't find it! Automatic updates should not be enabled, this Debian doesn't have GNOME so should not be enabled. Thank you. — omega1, Mar 29 '20 at 03:00

liverwust · Accepted Answer · 2020-03-29T22:22:36.837

Am I correct to assume that the blue line represents inbound traffic (downloaded to the VM from the outside) and that the purple line represents outbound (uploaded from the VM to the outside)? If so, then the Windows Quality of Service (QoS) features underpinning Hyper-V Bandwidth Management will not work to reduce the inbound spikes:

Note: You can use QoS to control outbound traffic, but not the inbound traffic. For example, with Hyper-V Replica, you can use QoS to control outbound traffic (from the primary server), but not the inbound traffic (from the Replica server).

Also see this TechNet discussion, which reinforces the relevance to Hyper-V:

I can confirm, that [Hyper-V maximum bandwidth] is applied for VM's outbound traffic only. But this fact is not mentioned in the documentation. Is this bug or feature?

Try to identify the specific application or service which is consuming the bandwidth. One way to do this is using atop, which is available in the Debian repositories. However, you will need to manually install the netatop kernel module, which enables per-process network accounting but is not included in the Debian package. Full instructions are on the website and are summarized here:

Download the latest netatop-x.x.tar.gz
Install the packages zlib1g-dev, build-essential, and linux-headers-amd64 (assuming 64-bit architecture)
Build and install the module and daemon. From the topdirectory of the extracted archive and run the following commands:
```
make
sudo make install
```
To load the module and start the daemon:
```
systemctl start netatop
```
To load the module and start the daemon automatically after boot:
```
systemctl enable netatop
```

Run sudo atop -n on the virtual machine and wait for a network spike. You will probably be able to spot the offending service by its high BANDWI and NET values, like sshd in this example:

By the way, I am assuming that your network graph is specifically measuring the virtual machine's network adapter. If not — for example, if it is measuring the physical adapter on the Hyper-V server — then it may actually be a Windows process which is causing the spikes. The approach to solving this would be similar, and you would start by finding an atop analogue for Windows.

UPDATE:

Your screenshot indicates that the number of Layer 3 IP packets during this time period (ipi = 866802) grossly exceeds the combined total of ICMP packets (icmpi = 199) plus Layer 4 TCP/UDP packets (tcpi=4316, udpi=47). This, plus the lack of participation by any running process, suggests that the VM is being flooded with malformed (malicious?) traffic by an outside source.

You'll want to apply davidgo's suggestion to use tcpdump. One way you might use it is by running a bash loop to wait until the incoming packets per second exceeds a threshold:

#!/bin/bash
threshold=10000   # packets/sec; note that atop(1) reports packets per 10sec by default
waiting=1
while [[ $waiting -eq 1 ]]
do
    atopsar -w 10 1 | tail -n1 | awk "\$2 < $threshold {exit 1}"
    waiting=$?
done
tcpdump -ieth0 -w out.pcap

After the problem occurs, you can copy the resulting out.pcap file to another computer and then open it with Wireshark. From there, apply Statistics -> Endpoints to see where the excess traffic is coming from. If a device in your local network — maybe even the Hyper-V server — is generating the traffic, then you can reconfigure it to stop. If a single IP on the Internet is generating the traffic, then you can find a way to blacklist it using your firewall. If it is many IPs, then you may need to read about Distributed Denial-of-Service attacks (DDoS) and how to use your firewall and/or ISP to block the traffic. Many DDoS articles are available online, like this one from Amazon.

Thank you for your detailed answer, I thought as much regarding. Hyper-V bandwidth management. I have installed netatop as per your suggestions and have it loaded and waiting for the next occurrence — omega1, Mar 29 '20 at 02:56
I have installed netatop as per your suggestion and caught the issue, but still cannot make sense of how/where the issue is happening? Any ideas? Thanks! — omega1, Mar 29 '20 at 10:11
@omega1 I have added some follow-up steps to the answer, which may be of some use. — liverwust, Mar 29 '20 at 22:23
Thank you so much for helping out with this, I'm pretty stumped with it! I have done as you suggested and have the above script loaded and will await the next occurrence and will see what I can find in the pcap file. Thanks again. — omega1, Mar 30 '20 at 01:23
@omega1: you might want to add "-c 200000" to your tcpdump call. This will cause tcpdump to exit after collecting 200k packets, rather than running indefinitely after the problem starts. You might also want to check that out.pcap hasn't grown to a huge size in the interim. — liverwust, Apr 01 '20 at 04:35
Thanks, I had already added some max file and rotation to hopefully address this point. It hasn't happened since, so haven't captured anything yet to analyse. Not sure if any remedial work I'd done in the firewall may have captured it, or the issue has just 'gone away'. I'll report back if it does happen with any interesting content in the captured logs. Thanks again for your help. — omega1, Apr 01 '20 at 09:56
You were right, it is a malicious attack, I have updated the original port with a sample line of what was captured using the filter udp.srcport == 53 and showed multiple IP addresses trying to attack a website through my server. Now need to figure out how to block these types of attacks. Thank you for your assistance in helping identify the problem, at least I know what it is now, just need to try and protect the server. Thank you. — omega1, Apr 03 '20 at 10:38

Trying to identify bandwidth usage/spike on Linux Debian VM

1 Answers1