Slow ESXi guest performance affecting entire host

Question

I'm running ESXi 5.5 (Build 2068190) on a Dell PowerEdge R220 server with the Intel E3-1220v3 CPU. It also has 16 GB of RAM installed and 2 x 1TB SATA disks running as RAID1 using a Dell PERC H310 controller.

Here's the problem. A few hours ago I noticed one of the guests causing major cpu spikes on the server. Spikes that were so intense the entire host freeze which also affected all other guests on the host. The guest in question only has 1 core assigned to it and running Debian 7 x64.

Have a look at the attached image below.

esxi cpu performance

The lag spikes on the left side of the chart occured about every other minute and lasted for about a full minute. The longer stop between 22:05 and 22:10 was when I shut the guest down to confirm that it was causing the cpu spikes. What happens at 22:25 is that I limited the guest CPU to 2 GHz. This stopped the spikes from happening, but now the entire server runs very slowly. When clicking something in vSphere client it takes about 5 seconds to bring up a new window.

The only thing I did before this happened was to change the name of a vSwitch, I don't know if that's what really caused it though. I also made some changes on a different guest acting as a gateway for the other guests running vyos but I fail to see how that can cause it.

And no, I don't have access to the guest at hand because it belongs to a customer. However I know that it only runs apache2, mysql and mailman.

My questions are:

a) Anyone know what is causing this or what I can do to find out what is causing it?

b) I didn't think one guest would be able to affect the entire host and other guests in this way, is this how it is supposed to be?

Thanks in advance, let me know if you need more info.

EDIT: After digging we found out the guest VPS had been compromised and was used as a FTP dump by hackers, which explains the intense traffic (350 GB in a couple of hours). However, it doesn't explain why it affected the host or other guests. Do I need to limit CPU performance by clockrate rather than just number of cores in order to avoid having one guest affect others? Or could it be something different like the vSwitches (and in turn ESXi) were somehow overloaded with work?

EDIT 2: Turns out it wasn't an FTP dump rather they made the server take part in a ddos attack of some sort. Our ISP called us later saying the amount of traffic had affected their other services / customers so I'm guessing it was quite a bit of traffic.

What are the full specifications of the GUEST virtual machine? — ewwhite, Mar 04 '15 at 22:04
Can you elimante that this is caused by e.g. a faulty hard disc and just triggered by some action at the gues? — frlan, Mar 04 '15 at 22:14
I did some more testing, apparently the troublesome guest OS (guest A) and the guest OS running the gateway software to the Internet (guest B) have overlapping performance. If I bring guest B down the load of guest A goes down and vice versa. This suggests there is something wrong with the internal networking? — , Mar 04 '15 at 22:27
The guest has 1 core CPU, 1 GB RAM and 20 GB of HDD running Debian 7 x64. As for the other settings I left them at their defaults (for Debian 6 x64 since Debian 7 does not exist in the list). — , Mar 04 '15 at 22:28
As for the HW I don't think anything is wrong with it as normal performance is restored when I bring down the guests (see comment above). I'm not physically near the server so I can't check it right now though. — , Mar 04 '15 at 22:29
I changed the CPU limit back to unlimited. It now shows similar behaviour as before the 2 GHz restriction (meaning 1-minute spikes with 2 minutes in-between spikes). — , Mar 04 '15 at 22:40
This must have something to do with the networking. Whenever the CPU spikes, so does the network. This really wouldn't bother me if it only affected the guest causing it, but this is affecting all guests and even the host so it cannot be used. — , Mar 04 '15 at 22:44
If you aren't using seperate NICs for management and VM traffic then that's probably why you'd see latency in the VMware client. Depending on what kind of NICs you have in the host, it might have been spending it's time doing IO down the wire instead of offloading it to the NIC. — GregL, Mar 05 '15 at 14:24
Yeah the management NIC is the same as the one for the guests. Also I'd like to change my statement above, it wasn't used as an ftp dump. Rather it was used to take part in a ddos, sending massive portions of random data to other destinations. In fact, my ISP called me later that day saying we were under attack and that the amount of traffic had affected some of their other customers/services (their router couldn't handle the traffic). So I'm guessing it was quite a lot of traffic. — , Mar 06 '15 at 13:07
If you're shaing the NIC with management and the guests, it begs the question of how you determined there were `Spikes that were so intense the entire host freeze which also affected all other guests on the host.`. Presumbly you were trying to talk to them over the network and the guests were un-responsive.. A saturated NIC would easily cause the connections to time out. — GregL, Mar 06 '15 at 13:34
Because the spikes only lasted a minute and left a 2-minute-non-spike-gap in-between. Those two minutes were used to monitor the graphs in the vsphere application. I also tried to use the console in the vsphere application during the spikes but it had a hard time even registering the keystrokes on my keyboard. But yes, I did connect to the admin interface over the network, not locally. However the guests were accessed through console. So anyway, what you're saying is that if I should have accessed the console locally (didn't think that was possible?) I wouldn't have noticied these issues? — , Mar 09 '15 at 11:23
My question is more about how you knew the other guests were unresponsive duing the spike. I assume you were trying to RDP/SSH into them and weren't able to. This would have the same root issue as what caused you to experience a laggy vSphere client. The NIC was saturated and nothing could get through. — GregL, Mar 10 '15 at 17:09
Well, that is true. It could be the NIC being saturated. However, I did notice in the graphs that during spikes, ALL physical cores were used by the client who is supposed to have 1 virtual core. Now I understand that virtual cpu cores are different from physical ones, but is this how it is supposed to be? If so, how would I make it so that a guest machine could never use more than 1 physical core at a time, or is it simply not possible? When I limited the guest to use only 1 GHz it was possible to use the vsphere client during the spikes. — , Mar 16 '15 at 16:50

Slow ESXi guest performance affecting entire host

0 Answers0