11

I have a concurrent system with many machines/nodes involved. Each machine run several JVMs doing different stuff. It is a "layered" architecture where each layer consists of many JVM running across the machines. Basically the top-layer JVM receives input from the outside via files, parses the input and sends it as many small records for "storage" in layer-two. Layer-two doesn't actually persist the data itself but actually persists it in layer-three (HBase and Solr) and HBase actually doesn't persist it itself either since it sends it to layer-four (HDFS) for persistence.

Most of the communication among the layers is synchronized so of course it ends up in a lot of threads waiting for lower layers to complete. But I would expect those waiting threads to be "free" wrt CPU usage.

I see a very high iowait (%wa in top) though - something like 80-90% iowait and only 10-20% sys/usr CPU usage. The system seems exhausted - slow to login via ssh and slow to respond to commands etc.

My question is if all those JVM threads waiting for lower layers to complete can cause this? Is it not supposed to be "free" waiting for responses (sockets). Does it matter with respect to this, whether the different layers uses blocking or non-blocking (NIO) io? Exactly in what situations does Linux count something as iowait (%wa in top)? When all threads in all JVMs on the machines are in a situation where it is waiting (counting because there is no other thread to run to do something meaningful in the meantime)? Or does threads waiting also count in %wa even though there are other processes ready to use the CPU for real processing?

I would really want to get a thorough explanation on how it works and how to interpret this high %wa. In the beginning I guessed that it counted as %wa when all threads where waiting, but that there where actually plenty of room for doing more, so I tried to increase the number of threads expecting to get more throughput, but that doesn't happen. So it is a real problem, not just a "visual" problem looking at top.

The output below is taken from a machine where only HBase and HDFS is running. It is on machines with HBase and/or HDFS that the problem i showing (most clearly)

--- jps ---
19498 DataNode
19690 HRegionServer
19327 SecondaryNameNode

---- typical top -------
top - 11:13:21 up 14 days, 18:20,  1 user,  load average: 4.83, 4.50, 4.25
Tasks:  99 total,   1 running,  98 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.1%us,  4.3%sy,  0.0%ni,  5.4%id, 74.8%wa,  0.0%hi,  1.3%si,  0.0%st
Mem:   7133800k total,  7099632k used,    34168k free,    55540k buffers
Swap:   487416k total,      248k used,   487168k free,  2076804k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+
COMMAND
19690 hbase     20   0 4629m 4.2g 9244 S   51 61.7 194:08.84 java
19498 hdfs      20   0 1030m 116m 9076 S   16  1.7  75:29.26 java

---- iostat -kd 1 ----
root@edrxen1-2:~# iostat -kd 1
Linux 2.6.32-29-server (edrxen1-2)      02/22/2012      _x86_64_        (2 CPU)
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              3.53         3.36        15.66    4279502   19973226
dm-0            319.44      6959.14       422.37 8876213913  538720280
dm-1              0.00         0.00         0.00        912        624
xvdb            229.03      6955.81       406.71 8871957888  518747772
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0
dm-0            122.00      3852.00         0.00       3852          0
dm-1              0.00         0.00         0.00          0          0
xvdb            105.00      3252.00         0.00       3252          0
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0
dm-0             57.00      1712.00         0.00       1712          0
dm-1              0.00         0.00         0.00          0          0
xvdb             78.00      2428.00         0.00       2428          0

--- iostat -x ---
Linux 2.6.32-29-server (edrxen1-2)      02/22/2012      _x86_64_        (2 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.06    0.00    3.29   65.14    0.08   23.43
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.74    0.35    3.18     6.72    31.32    10.78     0.11   30.28   6.24   2.20
dm-0              0.00     0.00  213.15  106.59 13866.95   852.73    46.04     1.29   14.41   2.83  90.58
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    5.78   1.12   0.00
xvdb              0.07    86.97  212.73   15.69 13860.27   821.42    64.27     2.44   25.21   3.96  90.47

--- free -o ----
             total       used       free     shared    buffers     cached
Mem:       7133800    7099452      34348          0      55612    2082364
Swap:       487416        248     487168
slm
  • 15,396
  • 12
  • 109
  • 124
Per Steffensen
  • 613
  • 2
  • 7
  • 19
  • I see a variety of similar question here and there, but this one on ServerFault has some things to try wrt hardware errors: http://serverfault.com/questions/83778/finding-the-root-cause-of-100-iowait-in-linux Here's another along the same lines, that is, maybe there's an error condition, with some other debugging around the issue: http://www.articledashboard.com/Article/Linux-and-High-I-O-Wait/959842 – Don Branson Feb 21 '12 at 18:29
  • Continuing that thought...error conditions are not your issue here, assuming that you're seeing this on multiple physical machines, but the tools in those discussions might give some additional detail about the waits. Having said that, I'm very interested for someone to respond to the "thorough explanation on how it works" part of your question. – Don Branson Feb 21 '12 at 19:04
  • There is a state column in top. What does it show when you view the threads on one box? Can you provide a `top` output? The results of `iostat -kd 1`? The results of `free -o`? – ingyhere Feb 22 '12 at 01:54
  • Well, you might see lots of questions, but I see all but one as deepening questions and "suggested part answers" :-) THE question is what conditions in many JVMs on one linux machine will make the OS count iowait (%wa in top) – Per Steffensen Feb 22 '12 at 07:04

1 Answers1

2

IO wait on Linux indicates that processes are blocked on uninterruptible I/O. In practice, it typically means that the process is performing disk access -- in this case, I'd guess one of the following:

  • hdfs is performing a lot of disk accesses, and it's making other disk access slow as a result. (Checking iostat -x may help, as it'll show an extra "%util" column which indicates what percentage of the time the disk is "busy".)
  • You're running low on system memory under load, and are ending up dipping into swap sometimes.
  • Thanks for responding. I have added output from "iostat -x" to the original post. – Per Steffensen Feb 22 '12 at 12:01
  • 1
    I knew what is considered IO wait seen from the OS side - "uninterruptable I/O". But it doesnt make it clear what kind of stuff in java code makes a thread do "uninterruptable I/O". Besides that a JVM runs several threads which typically does not map 1-1 with OS processes. So one OS process will be executing the work of many JVM threads. So how does threads doing "unint I/O" translate to the process being counted as doing "unint I/O" - when all threads are doing unint I/O, or when some threads are doing it? Or? Thats was the essence of the question. – Per Steffensen Feb 22 '12 at 12:09
  • The iostat output says your disks have been 90% busy, on average, during the time the machine's been up -- they're giving you all they've got. Time for more, faster disks! –  Feb 22 '12 at 15:41
  • No doubt about it, duskwuff. Or time for optimizing the flow to use much less disk I/O – Per Steffensen Feb 23 '12 at 07:09