I have a large number of small jobs that I need to run. If I run them on a 6core Xeon, Broadwell, it runs with at least 80-90% userland CPU
If I run the same assignment on a box with 2X16 core CPU Broadwell, if I scale the number of jobs, I end up with 80% system CPU, use, and the theoughput is only about a factor of 3 vs the single 6 core CPU, despite having 5x the cores and clocked faster.
Any suggestions to improve on this?
EDIT
The problem seems to become especially bad if the jobs are below a certain size, if they run on slightly larger data sets the system CPU use doesn't go as high - leading me to suspect that there is some limit as to the rate at which BSD can spawn processes.
as suggested below
/usr/share/dtrace/toolkit/procsystime
gives us for its top entries on the 2x16 core machines
readlink 80898169570
select 128032327883
execve 215209078214
wait4 2127022159693
read 2545974471446
and on the 6 core machines
_umtx_op 5997915963
select 8746697465
read 59777849114
wait4 61693132566
which doesn't seem to be enough of a difference to account for this non-linear scaling.
EDIT
When the system is under this load, running uname
in a loop takes half a second per execution, vs milliseconds when the machine is idle. There seems to be some kind of kernel issue here