I'm investigating a strange situation with extreme load (but low CPU, IOWAIT, SYS, disk/network activity) on one of our linux servers on which many oracle database instances are running.
When having some cpu load on the machine, everything looks quite normal, as we can see in vmstat:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
2 0 193312 1959440 403368 38270032 0 0 2504 25 4660 3916 16 5 78 1 0
1 0 193312 1899352 403560 38277932 0 0 2198 1720 5175 4675 14 5 79 2 0
1 3 193312 1878992 403584 38279024 0 0 13900 14261 2054 2704 6 1 93 1 0
As soon as the machine is idling, it shows a strange behavior:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
102 0 173164 1909664 408596 38271676 0 0 1124 1094 2312 2345 4 3 91 3 0
15 5 173164 1909968 408620 38271904 0 0 256 677 472 843 2 0 98 0 0
94 0 173164 1858988 408664 38311548 0 0 292 161 1786 1600 5 2 92 1 0
51 0 173164 1859832 408664 38311856 0 0 849 957 1044 1436 2 1 95 2 0
200 12 173164 1859708 408680 38311840 0 0 96 536 559 636 1 0 99 0 0
60 0 173164 1844480 408704 38328308 0 0 288 653 632 830 1 1 98 0 0
124 0 173164 1845340 408712 38329748 0 0 144 947 1746 1926 3 2 95 1 0
48 1 173164 1850432 408752 38329792 0 0 268 844 1783 1862 2 3 94 1 0
509 0 173164 1882880 408752 38329832 0 0 60 12 1893 1726 2 2 95 1 0
125 0 173164 1884208 408768 38329868 0 0 160 816 1918 2426 4 1 95 0 0
As yo can see in the first column, we have a high run queue length, what sar -q 1
confirms:
04:18:29 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
04:18:30 PM 3 1288 15.13 9.40 11.92
04:18:31 PM 1 1298 15.13 9.40 11.92
04:18:32 PM 34 1298 15.13 9.40 11.92
04:18:34 PM 3 1298 15.28 9.52 11.95
04:18:35 PM 64 1302 15.28 9.52 11.95
04:18:36 PM 5 1302 15.28 9.52 11.95
04:18:37 PM 0 1302 15.28 9.52 11.95
04:18:38 PM 62 1302 15.28 9.52 11.95
04:18:39 PM 13 1302 17.26 10.03 12.10
04:18:40 PM 3 1303 17.26 10.03 12.10
04:18:41 PM 90 1302 17.26 10.03 12.10
04:18:42 PM 7 1302 17.26 10.03 12.10
04:18:43 PM 17 1302 16.12 9.91 12.05
04:18:44 PM 400 1302 16.12 9.91 12.05
As a consequence, load average is looking very scary, too (using uptime
):
16:21:43 up 16 days, 2:16, 2 users, load average: 78.91, 28.88, 18.29
As soon as there's some cpu load, everything seems to normalize:
15:18:27 up 16 days, 1:13, 1 user, load average: 0.95, 1.37, 1.81
ps r -A
shows nothing interesting (I suspect many processes in R
or D
state - but this is not the case):
PID TTY STAT TIME COMMAND
28947 pts/0 R+ 0:00 ps r -A
I can rule out many short-living processes as a cause.
Questions:
- What can cause such a behavior?
- How to drill down this issue further?