I'm having difficulty with my Python app going to 100% 'system' CPU time across multiple threads for up to 10+ seconds. My (limited) understanding is that 'system' CPU time is time spent in the kernel. strace
shows what kernel calls are being made and (optionally) how long they take. I've managed to run strace on one thread during one of these pauses but the strace output just has a gap for that period and the calls either side of that gap are shown with a very short time.
How do I run strace properly to figure out what kernel call is happening?
In my Python app I use a separate thread and psutil
to check for any threads with high system CPU and then log them out. Once I see one I use the following command line to start strace:
strace -tT -o ~/strace.log -p <PID of thread>
Logging from my app (uses psutil
under the hood):
2015-05-19 08:05:47,025 WARNING:log.cpu Logging paused for 0.34s to 08:05:46.985768 - longer than expected - other threads probably delayed as well
2015-05-19 08:05:47,026 WARNING:log.cpu This process utilization (CPUs) user: 0.1, system: 52.4 | OS (%) scputimes(user=0.4, nice=0.0, system=80.2, idle=19.3, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,027 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 1.0
2015-05-19 08:05:47,512 WARNING:log.cpu This process utilization (CPUs) user: 0.1, system: 61.4 | OS (%) scputimes(user=0.5, nice=0.0, system=99.4, idle=0.1, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,516 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.9
2015-05-19 08:05:47,995 WARNING:log.cpu This process utilization (CPUs) user: 0.3, system: 63.7 | OS (%) scputimes(user=0.8, nice=0.0, system=99.2, idle=0.0, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:47,995 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.8
2015-05-19 08:05:48,495 WARNING:log.cpu This process utilization (CPUs) user: 0.3, system: 63.8 | OS (%) scputimes(user=0.7, nice=0.0, system=99.3, idle=0.0, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:48,496 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 0.8
2015-05-19 08:05:52,140 WARNING:log.cpu Logging paused for 3.12s to 08:05:52.103363 - longer than expected - other threads probably delayed as well
2015-05-19 08:05:52,153 WARNING:log.cpu This process utilization (CPUs) user: 0.0, system: 63.6 | OS (%) scputimes(user=0.3, nice=0.0, system=98.5, idle=1.2, iowait=0.0, irq=0.0, softirq=0.0, steal=0.0, guest=0.0, guest_nice=0.0), memory used: 16.6, threads: 101, fds: 70
2015-05-19 08:05:52,180 WARNING:log.cpu Thread 47441 utilization (CPUs) user: 0.0, system: 1.0
strace log during the same period on that thread ID 47441:
08:05:46 futex(0x7f7b46b305ac, FUTEX_WAKE, 1) = 1 <0.000030>
08:05:46 sched_yield() = 0 <0.000016>
08:05:46 sched_yield() = 0 <0.000017>
08:05:46 sched_yield() = 0 <0.000021>
08:05:46 sched_yield() = 0 <0.000053>
08:05:46 sched_yield() = 0 <0.000023>
08:05:46 sched_yield() = 0 <0.000020>
08:05:46 sched_yield() = 0 <0.000019>
08:05:46 sched_yield() = 0 <0.000021>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000574>
08:05:46 futex(0x7f7b46b30f2c, FUTEX_WAKE, 1) = 1 <0.000019>
08:05:46 futex(0x7f7b46b3072c, FUTEX_WAKE, 1) = 1 <0.000028>
08:05:46 sched_yield() = 0 <0.000022>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000305>
08:05:46 futex(0x7f7b46b31fac, FUTEX_WAKE, 1) = 1 <0.000119>
08:05:46 futex(0x7f7b46b319ac, FUTEX_WAKE, 1) = 1 <0.000073>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000474>
08:05:46 futex(0x7f7b46b31f2c, FUTEX_WAKE, 1) = 1 <0.000094>
08:05:46 futex(0x7f7b46b30dac, FUTEX_WAKE, 1) = 1 <0.000096>
08:05:46 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.001139>
08:05:46 futex(0x7f7b46b31eac, FUTEX_WAKE, 1) = 1 <0.000156>
08:05:46 futex(0x7f7b46b303ac, FUTEX_WAKE, 1) = 1 <0.000059>
08:05:52 sched_yield() = 0 <0.000081>
08:05:52 sched_yield() = 0 <0.000050>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.001603>
08:05:52 futex(0x7f7b46b31d2c, FUTEX_WAKE, 1) = 1 <0.000318>
08:05:52 futex(0x7f7b46b30f2c, FUTEX_WAKE, 1) = 1 <0.000259>
08:05:52 sched_yield() = 0 <0.000048>
08:05:52 sched_yield() = 0 <0.000060>
08:05:52 sched_yield() = 0 <0.000203>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.000148>
08:05:52 sched_yield() = 0 <0.000036>
08:05:52 sched_yield() = 0 <0.000164>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.003125>
08:05:52 sched_yield() = 0 <0.000053>
08:05:52 sched_yield() = 0 <0.000052>
08:05:52 sched_yield() = 0 <0.000023>
08:05:52 futex(0x7f7b46b3202c, FUTEX_WAIT, 2, NULL) = 0 <0.002264>
08:05:52 futex(0x7f7b46b3122c, FUTEX_WAKE, 1) = 1 <0.000119>
Note how strace is logging 100's of entries per second and then there's a 6 second gap of nothing. But no calls are shown as taking more than a couple milliseconds. So how come that thread showed 100% 'system' CPU time?
I've used top
to confirm that psutil
is reporting correct numbers. And I'm sure I've traced the correct thread because the 6 second gap corresponds exactly to the high CPU time recorded by psutil.
Help also appreciated with what other tools could I use to figure what is causing this kernel/system time?
Environment: - Python 3.4.3 (issue seen on Python 2.7 as well) - CentOS 7 kernel 3.10.0-229.1.2.el7.x86_64 (problem seen on Ubuntu 12 LTS as well) - 64 core AMD machine