Have an xlarge instance in AWS running 9 Tomcats with heaps from 256M to 4G. With Ubuntu 10.04 the box sporadically hangs for a few hours with huge run queue (30-40), and nothing on CPU, then recovers. Was suspecting GC, but reproed both with and without CMS GC.
After upgrading to 10.10, machine goes into 100% wait in a couple of hours after start, again with no processes on CPU. Here is output from top:
top - 18:33:44 up 3:11, 2 users, load average: 26.99, 26.80, 25.82
Tasks: 126 total, 1 running, 125 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 0.0%id,100.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 15373736k total, 15174780k used, 198956k free, 51288k buffers
Swap: 0k total, 0k used, 0k free, 6208956k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5910 tomcat6 20 0 746m 361m 9872 S 0 2.4 2:01.32 java
10147 tomcat6 20 0 919m 173m 9.8m S 0 1.2 0:22.60 java
12328 ubuntu 20 0 19276 1320 968 R 0 0.0 0:01.41 top
1 root 20 0 23864 2012 1300 S 0 0.0 0:00.38 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
...
Nothing useful in GC log (on larger instances, with MarkSweep, major GC occurs every 5 min and takes ~4s, incremental is completing in .1 - .2s, plenty of free memory in all generations).
Here is dstat output:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
5 1 51 43 0 0| 63k 512k| 0 0 | 0 0 | 435 401
0 0 0 100 0 0| 0 0 | 52B 834B| 0 0 | 185 315
0 0 0 100 0 0| 0 0 |4997B 14k| 0 0 | 247 360
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 146 318
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 149 314
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 145 318
0 0 0 100 0 0| 0 0 |4997B 14k| 0 0 | 227 345
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 158 325
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 160 306
0 0 0 100 0 0| 0 0 | 52B 354B| 0 0 | 148 319
0 0 0 100 0 0| 0 0 |4619B 14k| 0 0 | 224 353
At the time when wait started going through the roof, it was at the end of downloading/parsing a bunch of large files from s3 and writing them locally to disk (instance store). Thread dump (on jconsole, can't kill -3 on the box - hangs), shows single thread blocked at writing to disk.
I am lost. Which rock to turn next? What may be going on here?