-1

I have a large number of small jobs that I need to run. If I run them on a 6core Xeon, Broadwell, it runs with at least 80-90% userland CPU

If I run the same assignment on a box with 2X16 core CPU Broadwell, if I scale the number of jobs, I end up with 80% system CPU, use, and the theoughput is only about a factor of 3 vs the single 6 core CPU, despite having 5x the cores and clocked faster.

Any suggestions to improve on this?

EDIT

The problem seems to become especially bad if the jobs are below a certain size, if they run on slightly larger data sets the system CPU use doesn't go as high - leading me to suspect that there is some limit as to the rate at which BSD can spawn processes.

as suggested below

/usr/share/dtrace/toolkit/procsystime

gives us for its top entries on the 2x16 core machines

    readlink        80898169570
      select       128032327883
      execve       215209078214
       wait4      2127022159693
        read      2545974471446

and on the 6 core machines

    _umtx_op         5997915963
      select         8746697465
        read        59777849114
       wait4        61693132566

which doesn't seem to be enough of a difference to account for this non-linear scaling.

EDIT

When the system is under this load, running uname in a loop takes half a second per execution, vs milliseconds when the machine is idle. There seems to be some kind of kernel issue here

camelccc
  • 255
  • 1
  • 15
  • Look at `top -SH` while jobs are running and see how CPU usage by kernel thread are changing with increasing number of jobs. – citrin Nov 13 '16 at 16:13
  • Tried that. It doesn't look very different to top without sh. – camelccc Nov 13 '16 at 17:09
  • There are a lot of factors involved. Are these jobs independent? do they share resources like disk i/o or memory? What are they? is this software developed by your company or is it a secret? impossible to answer this question with the information given and likely there is no decent answer available. – hookenz Nov 13 '16 at 19:10

2 Answers2

2

Profiling will show what is taking up CPU time. With significant time spent in system, focusing on system calls may find it.

Dtrace is helpful for this. /usr/share/dtrace/toolkit/procsystime will show CPU time by system call. If you need more detail, the author has tools to do flame graph visualizations.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34
  • added the output of that to the question. – camelccc Nov 18 '16 at 00:00
  • If it isn't apparent try the flame graph visualizations I linked. Sample user or kernel until you understand where things are spending time. Look at the source code of the jobs running if available. – John Mahowald Nov 18 '16 at 11:22
1

After trying to trace the source of this and finding a lot of inconsistency, I observed that the system time starts to go up very quickly if the CPU load exceeds 50%. I therefore tried disabling hyperthreading, in the bios, and the problem went away, the throughput of the machine went way up.

Clearly BSD and hyperthreading don't play nicely, at least for this type of workload. Resulting throughput increased by around 120% once hyperthreading was disabled

camelccc
  • 255
  • 1
  • 15