1

I have a long running task which is a Spark job (map/reduce variant) which runs locally on Mac OS (Mac Pro 12 Core). The time command returned:

./run  278.74s user 83.56s system 0% cpu 13:50:45.69 total

Does this mean that most of the time the job spent waiting for IO?

Alfa07
  • 3,314
  • 3
  • 26
  • 39

1 Answers1

1

The output tells you that you were blocked on something for most of the runtime of the process. It could be I/O to the disk or network, but it could be calls to sleep() or anything else that takes your process off-CPU. Because the total number in the output is "real time since the beginning of the process," the discrepancy could even be caused by other tasks on the system hogging all the resources and not giving your process a chance to run. Also, keep in mind that any time spent waiting for user input counts towards the total time, but you wouldn't normally consider that to be I/O bound (unless you consider the user to be a very slow device, but that's another discussion entirely).

In other words, the only thing you know is that your process by itself is not CPU-bound.

To figure out whether your process is doing a lot of I/O, you should probably use Activity Monitor's "Disk Activity" and "Network" screens to see if your system is maxing those out. A more general solution would be to profile your code.

I frequently use DTrace to do profiling on Mac OS, so here's a script which will tell you how many nanoseconds your process spent executing each syscall. For simple programs it is usually obvious based on what the syscalls are why they take a long time.

$ sudo dtrace -n '
    syscall:::entry /pid == $target/ {
        self->ts = timestamp;
    }
    syscall:::return /pid == $target && self->ts/ {
        @[probefunc] = sum(timestamp - self->ts);
        self->ts = 0;
    }
' -c java                          # <-- put './run' here instead of 'java'

    syscall:::entry ' matched 880 probes
dtrace: pid 38883 has exited
  ...
  close_nocancel                                               208017
  workq_kernreturn                                             222226
  open_nocancel                                               2569800
  open                                                        3098793
  read                                                        3467981
  stat64                                                      3641553
  getdirentries64                                             4212212
  kevent                                                     26820366

As you can see, filesystem calls take a lot of time during JVM startup. If one syscall is taking a long time to complete and you're not sure why, you can always get stack traces to figure out how those are getting called using this next script (substitute read for whatever is taking a long time):

$ sudo dtrace -n '
    syscall::read:entry /pid == $target/ {
        self->ts = timestamp;
    }
    syscall::read:return /pid == $target && self->ts/ {
        @[ustack()] = sum(timestamp - self->ts);
        self->ts = 0;
    }
' -c java

# output truncated -- you'll get a bunch of stack traces
# sorted by the amount of time each one used

There are also many other ways to use DTrace - learning to use it well can make debugging code and performance issues much easier.

Dan
  • 7,155
  • 2
  • 29
  • 54