8

For the last two weeks we are having intermittent severe spikes in system cpu usage (shown as %sys), which last for maybe half a minute, locking most processes, including ssh.

I've been trying to figure this out, but atop doesn't show anything relevant (system usage for processes it shows is insignificant), spikes are intermittent and I could not reproduce the spike using any workload for the web application this webserver hosts.

If you have any ideas on how to debug high %sys and (sometimes) %si CPU usage, please share them.

System specs (don't know if any of this is relevant): Dedicated server, CentOS 6, core i7 950, consistent 4 to 8 GB RAM free at any time, hard drives are in RAID-1.

Additional info:

  • dmesg output doesn't change between spikes
  • /var/log/messages doesn't change between spikes
  • Here is cat /proc/vmstat
  • Here is output of mpstat 1 during a typical spike

Add 07.11.11: looks like simple reboot restored system state, and we might never know what caused the disturbance in first place.

Mark
  • 91
  • 1
  • 1
  • 5
  • You could put some files onto a webpage from the time where you see that high load or locks: `screenshot of top`, `dmesg` and/or `/var/log/syslog`, `/proc/vmstat`. You could remove sensitive data before if needed. – ott-- Nov 03 '11 at 17:45
  • @ott-- added more info to the first post. – Mark Nov 03 '11 at 19:20
  • Apparently, I can't add any more links to the post without being considered a spammer, here is output of [iostat -x 5](http://pastie.org/pastes/2806116/text?key=ce5xc0ll22uylbl1igwdw) during a typical spike. – Mark Nov 03 '11 at 19:31
  • Do you run a BTRFS filesystem on Linux 3.0? – mailq Nov 03 '11 at 22:01
  • @mailq: No, `Linux 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux`, file systems are all ext3. – Mark Nov 04 '11 at 05:18
  • Is the server running any Java processes? Some reports of similar issues (including from me): http://forums.fedoraforum.org/showthread.php?t=285246 – Raman Dec 15 '12 at 16:15

5 Answers5

4

I know this thread is a really old and I know you are already aware of this, %sys --> if the cycle is spent in %system then much of the execution is done in lower level code i.e might be issue on kernel side. If this issue is reproducible again, please collect the output of:

echo t > /proc/sysrq-trigger

And check system messages (var/log/messages or /var/log/syslog) to see if any thread may be using a lot of system CPU time.

jjmontes
  • 3,387
  • 2
  • 19
  • 27
Prashant Lakhera
  • 713
  • 2
  • 10
  • 25
1

It sounds stupid, but reboot helped and we might never know, what caused the spikes in first place.

Thank you for responses, though.

Mark
  • 91
  • 1
  • 1
  • 5
  • 2
    for the next one `sar -I XALL 1 | grep -v 0.00` will tell you which interrups are getting that soft system time – theist Jun 26 '13 at 13:47
1

On centos 6.2 and 6.3, disable huge page support:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
John Gardeniers
  • 27,458
  • 12
  • 55
  • 109
Andrija
  • 11
  • 1
0

High %si would suggest a high interrupt rate (si is the time spent in softirq handlers, AFAIK). Therefore my first guess would be that the server network interface is being hammered.

janneb
  • 3,841
  • 19
  • 22
  • How can I prove or disprove this hypothesis? It does not seem like so, but it might be. – Mark Nov 03 '11 at 19:25
  • That %si corresponds to time spent in softirq handlers? Well, if you don't believe me or some other documentation you might find, you can read the kernel source code. – janneb Nov 04 '11 at 07:04
  • Err, no, hypothesis about network interface being hammered. – Mark Nov 04 '11 at 07:09
  • 1
    Ah, you can check e.g. /proc/softirq; unfortunately I don't know of any tool that displays individual softirqs over time. Alternatively, run something like "dstat -ar --socket --tcp" when you get a spike and post the results. – janneb Nov 04 '11 at 08:33
  • thanks a lot for dstat tool, it's kinda cool. Unfortunately, it's still not obvious, what causes spikes, see it for yourself: http://pastie.org/pastes/2821888/text?key=yqtv1iulh9nyhgahod1eq – Mark Nov 06 '11 at 19:58
-1

There are many factors contribute to the high %sys usage such as Logon, system call, context switch(both thread and procedure),IO and even sockets data copying from kernel mode to user mode. I suggest you can use sar, vmstat and iostat to check these out. Further more, it would be nice to find out which process caused the high %sys usage when spike. gdb would be helpful in this case.Find out the process and use gdb to attach on it and you will see what is going on with this process on this moment.Only thing you need to notice is this operation require debug information embedded in the procedure.