0

Have an xlarge instance in AWS running 9 Tomcats with heaps from 256M to 4G. With Ubuntu 10.04 the box sporadically hangs for a few hours with huge run queue (30-40), and nothing on CPU, then recovers. Was suspecting GC, but reproed both with and without CMS GC.

After upgrading to 10.10, machine goes into 100% wait in a couple of hours after start, again with no processes on CPU. Here is output from top:

top - 18:33:44 up  3:11,  2 users,  load average: 26.99, 26.80, 25.82
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  15373736k total, 15174780k used,   198956k free,    51288k buffers
Swap:        0k total,        0k used,        0k free,  6208956k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                       
 5910 tomcat6   20   0  746m 361m 9872 S    0  2.4   2:01.32 java                                                                           
10147 tomcat6   20   0  919m 173m 9.8m S    0  1.2   0:22.60 java                                                                           
12328 ubuntu    20   0 19276 1320  968 R    0  0.0   0:01.41 top                                                                            
    1 root      20   0 23864 2012 1300 S    0  0.0   0:00.38 init                                                                           
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd    
...

Nothing useful in GC log (on larger instances, with MarkSweep, major GC occurs every 5 min and takes ~4s, incremental is completing in .1 - .2s, plenty of free memory in all generations).

Here is dstat output:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  5   1  51  43   0   0|  63k  512k|   0     0 |   0     0 | 435   401 
  0   0   0 100   0   0|   0     0 |  52B  834B|   0     0 | 185   315 
  0   0   0 100   0   0|   0     0 |4997B   14k|   0     0 | 247   360 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 146   318 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 149   314 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 145   318 
  0   0   0 100   0   0|   0     0 |4997B   14k|   0     0 | 227   345 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 158   325 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 160   306 
  0   0   0 100   0   0|   0     0 |  52B  354B|   0     0 | 148   319 
  0   0   0 100   0   0|   0     0 |4619B   14k|   0     0 | 224   353

At the time when wait started going through the roof, it was at the end of downloading/parsing a bunch of large files from s3 and writing them locally to disk (instance store). Thread dump (on jconsole, can't kill -3 on the box - hangs), shows single thread blocked at writing to disk.

I am lost. Which rock to turn next? What may be going on here?

0 Answers0