3

I'm investigating a strange situation with extreme load (but low CPU, IOWAIT, SYS, disk/network activity) on one of our linux servers on which many oracle database instances are running.

When having some cpu load on the machine, everything looks quite normal, as we can see in vmstat:

 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 2  0 193312 1959440 403368 38270032    0    0  2504    25 4660 3916 16  5 78  1  0
 1  0 193312 1899352 403560 38277932    0    0  2198  1720 5175 4675 14  5 79  2  0
 1  3 193312 1878992 403584 38279024    0    0 13900 14261 2054 2704  6  1 93  1  0


As soon as the machine is idling, it shows a strange behavior:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
102  0 173164 1909664 408596 38271676    0    0  1124  1094 2312 2345  4  3 91  3  0
15  5 173164 1909968 408620 38271904    0    0   256   677  472  843  2  0 98  0  0
94  0 173164 1858988 408664 38311548    0    0   292   161 1786 1600  5  2 92  1  0
51  0 173164 1859832 408664 38311856    0    0   849   957 1044 1436  2  1 95  2  0
200 12 173164 1859708 408680 38311840    0    0    96   536  559  636  1  0 99  0  0
60  0 173164 1844480 408704 38328308    0    0   288   653  632  830  1  1 98  0  0
124  0 173164 1845340 408712 38329748    0    0   144   947 1746 1926  3  2 95  1  0
48  1 173164 1850432 408752 38329792    0    0   268   844 1783 1862  2  3 94  1  0
509  0 173164 1882880 408752 38329832    0    0    60    12 1893 1726  2  2 95  1  0
125  0 173164 1884208 408768 38329868    0    0   160   816 1918 2426  4  1 95  0  0


As yo can see in the first column, we have a high run queue length, what sar -q 1 confirms:

04:18:29 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
04:18:30 PM         3      1288     15.13      9.40     11.92
04:18:31 PM         1      1298     15.13      9.40     11.92
04:18:32 PM        34      1298     15.13      9.40     11.92
04:18:34 PM         3      1298     15.28      9.52     11.95
04:18:35 PM        64      1302     15.28      9.52     11.95
04:18:36 PM         5      1302     15.28      9.52     11.95
04:18:37 PM         0      1302     15.28      9.52     11.95
04:18:38 PM        62      1302     15.28      9.52     11.95
04:18:39 PM        13      1302     17.26     10.03     12.10
04:18:40 PM         3      1303     17.26     10.03     12.10
04:18:41 PM        90      1302     17.26     10.03     12.10
04:18:42 PM         7      1302     17.26     10.03     12.10
04:18:43 PM        17      1302     16.12      9.91     12.05
04:18:44 PM       400      1302     16.12      9.91     12.05

As a consequence, load average is looking very scary, too (using uptime):

16:21:43 up 16 days,  2:16,  2 users,  load average: 78.91, 28.88, 18.29

As soon as there's some cpu load, everything seems to normalize:

15:18:27 up 16 days,  1:13,  1 user,  load average: 0.95, 1.37, 1.81

ps r -A shows nothing interesting (I suspect many processes in R or D state - but this is not the case):

  PID TTY      STAT   TIME COMMAND
28947 pts/0    R+     0:00 ps r -A

I can rule out many short-living processes as a cause.

Questions:

  • What can cause such a behavior?
  • How to drill down this issue further?
MRalwasser
  • 203
  • 4
  • 10

2 Answers2

1

There is a bug in the latest kernel version (after 2.6.32), that causes the system to increase dramatically the load average on idle due to changes in the load average calculation method.

For more information you can check the Red Hat official post.

There's also an open case in bugzilla.

1

This is more an FYI (as I can't comment yet): The run queue counts threads, not processes. To look for runnable and uninterruptible sleeping threads using ps you can use:

ps -eLo stat,pid,user,command | egrep "^STAT|^D|^R" 
Nathan Webb
  • 129
  • 3