High system cpu load (%sys), system locks

Question

For the last two weeks we are having intermittent severe spikes in system cpu usage (shown as %sys), which last for maybe half a minute, locking most processes, including ssh.

I've been trying to figure this out, but atop doesn't show anything relevant (system usage for processes it shows is insignificant), spikes are intermittent and I could not reproduce the spike using any workload for the web application this webserver hosts.

If you have any ideas on how to debug high %sys and (sometimes) %si CPU usage, please share them.

System specs (don't know if any of this is relevant): Dedicated server, CentOS 6, core i7 950, consistent 4 to 8 GB RAM free at any time, hard drives are in RAID-1.

Additional info:

dmesg output doesn't change between spikes
/var/log/messages doesn't change between spikes
Here is cat /proc/vmstat
Here is output of mpstat 1 during a typical spike

Add 07.11.11: looks like simple reboot restored system state, and we might never know what caused the disturbance in first place.

You could put some files onto a webpage from the time where you see that high load or locks: `screenshot of top`, `dmesg` and/or `/var/log/syslog`, `/proc/vmstat`. You could remove sensitive data before if needed. — ott--, Nov 03 '11 at 17:45
Apparently, I can't add any more links to the post without being considered a spammer, here is output of [iostat -x 5](http://pastie.org/pastes/2806116/text?key=ce5xc0ll22uylbl1igwdw) during a typical spike. — Mark, Nov 03 '11 at 19:31
@mailq: No, `Linux 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux`, file systems are all ext3. — Mark, Nov 04 '11 at 05:18
Is the server running any Java processes? Some reports of similar issues (including from me): http://forums.fedoraforum.org/showthread.php?t=285246 — Raman, Dec 15 '12 at 16:15

score 4 · Answer 1 · edited Dec 03 '19 at 18:44

I know this thread is a really old and I know you are already aware of this, %sys --> if the cycle is spent in %system then much of the execution is done in lower level code i.e might be issue on kernel side. If this issue is reproducible again, please collect the output of:

echo t > /proc/sysrq-trigger

And check system messages (var/log/messages or /var/log/syslog) to see if any thread may be using a lot of system CPU time.

score 1 · Accepted Answer · answered Nov 08 '11 at 14:12

1

It sounds stupid, but reboot helped and we might never know, what caused the spikes in first place.

Thank you for responses, though.

answered Nov 08 '11 at 14:12

Mark

91
1
1
5

2

for the next one `sar -I XALL 1 | grep -v 0.00` will tell you which interrups are getting that soft system time – theist Jun 26 '13 at 13:47

score 1 · Answer 3 · edited Dec 01 '12 at 03:44

1

On centos 6.2 and 6.3, disable huge page support:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

edited Dec 01 '12 at 03:44

John Gardeniers

27,458
12
55
109

answered Nov 30 '12 at 22:10

Andrija

11
1

9

It would be helpful to explain what this does and why it would fix the problem. – Michael Hampton Dec 02 '12 at 21:25
Any explanation of link to more info? – JohnMudd Nov 16 '20 at 22:03

score 0 · Answer 4 · answered Nov 03 '11 at 17:10

0

High %si would suggest a high interrupt rate (si is the time spent in softirq handlers, AFAIK). Therefore my first guess would be that the server network interface is being hammered.

answered Nov 03 '11 at 17:10

janneb

3,841
19
22

How can I prove or disprove this hypothesis? It does not seem like so, but it might be. – Mark Nov 03 '11 at 19:25
That %si corresponds to time spent in softirq handlers? Well, if you don't believe me or some other documentation you might find, you can read the kernel source code. – janneb Nov 04 '11 at 07:04
Err, no, hypothesis about network interface being hammered. – Mark Nov 04 '11 at 07:09
1

Ah, you can check e.g. /proc/softirq; unfortunately I don't know of any tool that displays individual softirqs over time. Alternatively, run something like "dstat -ar --socket --tcp" when you get a spike and post the results. – janneb Nov 04 '11 at 08:33
thanks a lot for dstat tool, it's kinda cool. Unfortunately, it's still not obvious, what causes spikes, see it for yourself: http://pastie.org/pastes/2821888/text?key=yqtv1iulh9nyhgahod1eq – Mark Nov 06 '11 at 19:58

score -1 · Answer 5 · answered Dec 20 '14 at 08:55

There are many factors contribute to the high %sys usage such as Logon, system call, context switch(both thread and procedure),IO and even sockets data copying from kernel mode to user mode. I suggest you can use sar, vmstat and iostat to check these out. Further more, it would be nice to find out which process caused the high %sys usage when spike. gdb would be helpful in this case.Find out the process and use gdb to attach on it and you will see what is going on with this process on this moment.Only thing you need to notice is this operation require debug information embedded in the procedure.

High system cpu load (%sys), system locks

5 Answers5