Why is batch mode so much faster than parfor?

Question

I am writing matlab code to perform a 3 dimensional integral:

function [ fint ] = int3d_ser(R0, Rf, N)
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);

rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);

dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);

C = 1/((4/3)*pi);
fint = 0.0;
for ir = 2:Nr
  r = rs(ir);
  r2dr = r*r*dr;
  for it = 1:Nt-1
    t = ts(it);
    sintdt = sin(t)*dt;
    for ip = 1:Np-1
      p = ps(ip);
      fint = fint + C*r2dr*sintdt*dp;
    end 
  end 
end

end

for the associated int3d_par (parfor) version, I open a matlab pool and just replace the for with a parfor. I get pretty decent speedup with I run it on more cores (my tests are from 2 to 8 cores).

However, when I run the same integration in batch mode with:

function [fint] = int3d_batch_cluster(R0, Rf, N, cluster, ncores)

%%% note: This will not give back the same value as the serial or parpool version.
%%%       If this was a legit integration, I would worry more about even dispersion
%%%       of integration nodes per core, but I just want to benchmark right now so ... meh

Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);

rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);

dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);

C = 1/((4/3)*pi);

rns = floor( Nr/ncores )*ones(ncores,1);
RNS = zeros(ncores,1);
for icore = 1:ncores
  if(sum(rns) ~= Nr) 
    rns(icore) = rns(icore)+1;
  end 
end
RNS(1) = rns(1);
for icore = 2:ncores
  RNS(icore) = RNS(icore-1)+rns(icore);
end

rfs = rs(RNS);
r0s = zeros(ncores,1);
r0s(2:end) = rfs(1:end-1);

j = createJob(cluster);

for icore = 1:ncores
  r0 = r0s(icore);
  rf = rfs(icore);
  rn = rns(icore);
  trs = linspace(r0, rf, rn);
  t{icore} = createTask(j, @int3d_ser, 1, {r0, rf, rn});
end

submit(j);
wait(j);
fints = fetchOutputs(j);

fint = 0.0;
for ifint = 1:length(fints)
  fint = fint + fints{ifint};
end

end

I notice that it is much, much faster. Why would doing this integration in batch mode be different than doing it in parfor?

For reference, I test the code with N from small numbers like 10 and 20 (to get the constant in the polynomial approximation of runtime) to larger numbers like 1000 and 2000. This algorithm will scale cubicly since I assign the number of integration nodes in the theta and phi direction to be a constant multiple of the given N.

For 2000 nodes, the parfor version takes about 630 seconds, while the same number of nodes in batch mode takes about 19 seconds (where around 12 seconds is simply overhead communication that we also get for 10 integration nodes).

Which `for` are you replacing with a `parfor`? It might make a difference in that nested loop structure (e.g. hitting the `parfor` and incurring the parallelism setup/teardown overhead multiple times, or doing a less-than-ideal slice structure). The batch version code seems to have "flattened out" the nested loop structure by the time it gets to the parallel calls (i.e. by precalculating input chunks and doing the nested loops inside each chunk), which could account for lower parallelism overhead. — Andrew Janke, Apr 20 '15 at 02:31
The other thing is that, if I understand where you're putting the `parfor`, the batch version is moving a lot less data around: it's just passing the `r0`, `rf`, and `rn` parameters around and the intermediate variables are being constructed locally on each worker, but a `parfor` inside `int3d_ser` would cause subsets of the temporary variables created on the master to be marshalled and sent out to each worker. — Andrew Janke, Apr 20 '15 at 02:42
E.g. what happens if you take the `int3d_batch_cluster` function and replace the `createTask` calls with a `parfor icore = 1:ncores` around a normal function call to `int3d_ser`? That would tell you whether it's the `parfor` mechanism per se, or just how your code is implicitly structuring the batches of work to be sent to the workers. — Andrew Janke, Apr 20 '15 at 02:44
I have the very outside for loop doing the parfor to give more conent for the parfor to do so that the overhead is worth it. — drjrm3, Apr 21 '15 at 21:39
Could you review your batch cluster code? There are several unused variables in it. — Daniel, May 27 '15 at 18:10

score 1 · Accepted Answer · answered May 27 '15 at 19:34

After speaking with Mathworks support, it appears I had a fundamental misunderstanding of how parfor works. I was under the impression that parfor acted like openMP whereas batch mode was acting like mpi in terms of shared vs distributed memory.

It turns out that parfor actually uses distributed memory as well. When I am creating, say, 4 batch functions, the overhead for creating a new process is happening 4 times. I thought that using a parfor would cause that overhead to happen just 1 time and that the parfor would then take place in the same memory space. This is not the case.

In my example code, it turns out that for each iteration of the parfor, I am actually incurring the overhead of creating a new thread. When comparing 'apples to apples', I should really be creating the same number of batch calls as I am iterations in the parfor loop. This is why the parfor function was taking so much longer - I was incurring much more overhead for multiprocessing.

Does this imply that batch mode is faster than parfor? – Armin Meisterhirn Feb 01 '17 at 01:58 — Armin Meisterhirn, Feb 01 '17 at 01:58

Why is batch mode so much faster than parfor?

1 Answers1