1

I am noticing a weird issue, my Ubuntu (web)server randomly freezes, for a few seconds and afterwards recovering again. The server has the following specifications;

- 2 vCores of 2,4 GHz
- 8GB of RAM
- 40GB SSD
- 100 MBit network

I am mainly running the following services on the server;

- NGINX (webserver and proxy)
- Mysql
- Varnish

The issue doesn't occur every day, but on the days that it does it usually happens very frequently (about every 20 seconds). I am running Netdata as a web monitoring tool and Newrelic for critical issues.

This is a screenshot of the CPU graph taken from the Netdata dashboard This is a screenshot of the CPU graph taken from the Netdata dashboard, as you can see the server stops reporting stats when the freeze occurs. I found out that sometimes the IO/Wait spikes just before seeing the server freeze, but after reading threads and Googling about high IO/Wait I could not find anything useful other than that the [jbd2/vda1-8] process is constantly writing to the disk.

When running monitoring tools like top, ps, iotop and htop I do not see any process using excessive amounts of resources, even when the freezing issue occurs.

When logging into the server using the hosting provider's (OVH's) KVM I see the following message; NMI watchdog: BUG: soft lockup CPU#0/1 stuck for 21s! [process]. Also researching that error message didn't provide much information or a solution. I am currently running out of ideas on what could cause these issues so any help is appreciated.

Rick
  • 111
  • 3
  • This happen to me this bug when a HDD is about near to failure, as the controller inside the HDD got difficulty. Just a guess for your case, but we miss detail, I would post a SMART report of your HDD too to add some detail. – yagmoth555 Dec 13 '16 at 17:50
  • @yagmoth555 I'm running this on a VPS, I don't think I can run tests on the underlaying SSD, or can I? When I run `sudo smartctl -a /dev/vda1` I get the following; `Please specify device type with the -d option.` – Rick Dec 13 '16 at 18:31
  • In a screen session, I would use `vmstat 1` and leave it running. In the VPS provider, they should have graphs that show load of the host and your VM. – Aaron Dec 13 '16 at 18:57
  • @Aaron I see that while the server gets these freezes, the `r` (processes awaiting cpu time) goes up from 0 to anywhere between 40 and 80. And that the line after that shows `bo` between 100 and 3000. Is there a way that I can further track down the issue with these details? thanks – Rick Dec 13 '16 at 19:17
  • So 40 to 80 runnable processes waiting for IO. I would then look at `iostat -xhm 5` when this occurs and see what device you are trying to write to. `await` time would be important to make note of. You might consider opening a ticket with your vps provider once you have that if this is the only node with this behavior and each node gets the same traffic. There are many other things that could be going on that could be related to your mysql shared memory settings and possibly dirty page flushing. – Aaron Dec 14 '16 at 16:52

2 Answers2

1

Your VPS is probably throttling your CPU and disk usage, causing an apparent freeze when the throttling is too severe. Check, via top, if you CPU steal time is high during (or just before/after) freezes.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
0

I can not write a comment like all the others but must do with an "official" answer, although all I have to offer are guesses. :-) Since it is a VPS, i.e. a virtual machine (apparently KVM), I could imagine that your hoster does have some ongoing behind-the-scene infrastructure work or wrse, has reliability issues. That could mean for your VM that

  • The VM is migrated for some reason from one physical server to another (which requires the vcpus to be stopped for a moment, so that the VM state can be transfered over the network). An indication for this would be (in my experience) that the system clock is off by a few seconds, and needs to be corrected by the ntp daemon.
  • the storage, on which the virtual disk of you server resides, is not reachable for a short amount of time (this, btw, is most probably not a local SSD, but some disk space on a SAN or even NFS server). So processes would get stuck on I/O, which usually means that the system load value increases even though the CPU utilization is low.

As I said, just guesses, but perhaps it's an idea to have a talk with you hoster.

user415594
  • 11
  • 1
  • Could you maybe edit your answer to offer some checks the asker could run to try and identify which of your proposals is more likely? You mention load (which I think is a good lead) - maybe you could share how that could be checked, other than via netdata? – iwaseatenbyagrue Mar 04 '17 at 12:42