0

I am now managing a computer cluster for scientific computing. Some processes in the cluster need heavy io usage. Now I find such a process:

  1. Its state changes rapidly between DOWN and RUN and its cpu usage changes rapidly between 1% and 100%.
  2. In top output, iowait is 0% and idle is about 90%

I think this process may have some problems, but process owner claims that it is running properly in that it is still writing data to disks.

More Info:

  1. The process is writing data to a remote disk mounted on /home
  2. The process is based on slightly modified code. The original software is capable of multi-threading but needs huge amount of memory. The modified code use more disk and less memory but the one who modifies the code do not know anything about multi-threading.
  3. Small tests show that the code can give the correct result

Questions:

  1. why the process is not using 100% percent of cpu and if that is because of waiting for io why iowait is 0%?
  2. How to judge whether the process have any problems and what type of problems is that?
atbug
  • 103
  • 2

2 Answers2

2
 The process is writing data to a remote disk mounted on /home

There's probably your answer. Process state D is not DOWN, it is uninterruptible sleep and typically means some I/O to finish. As you have a network share, depending on conditions it might not show as I/O wait to you, and might not consume much CPU as your system is waiting.

However, for you and your application things are going on very slowly if the network share is slow due the way the application is writing, due the network or due the file server performance.

How to find out if your application or the network share is the reason? Simple -- test the network share performance with the other tools and other usage patterns. Copy lots of data back and forth from the /home to some other location and back, run some benchmarks such as iozone, test the raw network performance with iperf, stuff like that.

If those give you reasonable results, then go and see what your application is doing.

Many times the reason lies in elsewhere, though; without knowing anything about your system, I would guess that you need to tune NFS mount settings. But, for now it remains just a guess as I don't know if you even have NFS in use.

Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81
0

If the CPU is not busy, then your process is presumably waiting for something external. I'd imagine there's a good chance you'd make sense of it by looking at what system calls are taking longest with strace.

Falling that, try using a profiler to find out what the code is doing.

Does your code use mmap'd IO? I'm thinking that might not get reported as iowait time against your process, but would turn up as a system process using a lot of disk as it flushes pages to disk.

mc0e
  • 5,866
  • 18
  • 31
  • How to get the io usage of the process then? – atbug May 05 '15 at 05:12
  • And I want to know if there is a way to decide whether the process is running properly as it is, because I cannot ask the process owner to debug it. – atbug May 05 '15 at 05:42
  • I mostly use atop to see what's using how much IO, though as noted paging activity, including mmap files makes things difficult. I really think though that your first port of call should be strace. pay attention to hte '-T' and '-f options. You'll then need to process the output a bit to add things up. Being a perl programmer from way back, I use snippets of perl on the command line for that, but use whatever is comfortable for you. – mc0e May 07 '15 at 16:25