1

I am currently analyzing a command on Solaris that reads from a file and has very low performance. The truss -D commands shows me read system calls taking up to 0.03 seconds, but when I use truss -E, they are always either 0.0000 or 0.0001 (two orders of magnitude lower than with the -D option). In the man page, it says:

 -D
     Includes a time delta on each line of trace output.  The
     value appears as a field containing seconds.fraction and
     represents the elapsed time for the  LWP  that  incurred
     the event since the last reported event incurred by that
     LWP. Specifically, for system calls,  this  is  not  the
     time spent within the system call.

 -E
     Includes a time delta on each line of trace output.  The
     value appears as a field containing seconds.fraction and
     represents the difference in time  elapsed  between  the
     beginning and end of a system call.

     In contrast to  the -D option, this  is  the  amount  of
     time spent within the system call.

So the -E option measures the actual time spent within the system call, while -D does not... Can anyone explain what exactly is it that makes that difference? What is being done in the remaining time "outside" the system call?

dokaspar
  • 165
  • 1
  • 2
  • 8

2 Answers2

2

The time outside the system call is the time spent running your program's code before it gets to the next system call.

alanc
  • 1,500
  • 9
  • 12
2

Based on the documentation you quoted I find it quite clear that one is covering the entire duration from one system call to the next, while the other covers only the time within a system call.

The percentage of time spent inside system calls versus the percentage of time spent outside system calls will roughly tell you if a process is CPU bound.

A CPU bound process spend most of its time outside system calls. This is the state a process will be in, while it is performing computations. For a CPU bound process the difference between the two numbers will be large, and probably by at least an order of magnitude.

A process which is not CPU bound will be blocked waiting for events most of the time. Since blocking can only happen inside system calls. For a process which is not CPU bound, the numbers will be approximately identical (probably only differing by a one digit percentage).

That was the simple explanation, in reality there is a few more aspects to consider. Due to memory mapping and swapping, a process can actually block without being in a system calls. Additionally the kernel could offer features, which involve computations inside kernel code. That could lead to a process spending most time inside system calls and still be CPU bound. The latter for example could happen when using encrypted file systems.

kasperd
  • 30,455
  • 17
  • 76
  • 124
  • This answer is not really Solaris specific. What I am explaining would apply to pretty much any unix system. – kasperd Nov 21 '14 at 21:10
  • 1
    Now I'm confused, because the slow process that I'm analyzing is not using any noticeable CPU resources. It is spending a long time reading and seeking through files, the slowness seems to be caused by disk problems... – dokaspar Nov 21 '14 at 21:43
  • @Dominik I reread the quoted documentation and my answer, and I didn't find details which did not add up. But if you are seeing other data contradicting the conclusion that the process is CPU bound, then surely something must be amiss somewhere. And it may be that the documentation you quoted is erroneous or incomplete. If that turns out to be the case, then you don't just need somebody who can tell you how to read that documentation, but somebody who knows in which places the documentation is incorrect. And that would mean very specific Solaris knowledge (my knowledge of Solaris is not that d – kasperd Nov 21 '14 at 22:35