I am writing matlab code to perform a 3 dimensional integral:
function [ fint ] = int3d_ser(R0, Rf, N)
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);
rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);
dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);
C = 1/((4/3)*pi);
fint = 0.0;
for ir = 2:Nr
r = rs(ir);
r2dr = r*r*dr;
for it = 1:Nt-1
t = ts(it);
sintdt = sin(t)*dt;
for ip = 1:Np-1
p = ps(ip);
fint = fint + C*r2dr*sintdt*dp;
end
end
end
end
for the associated int3d_par
(parfor) version, I open a matlab pool and just replace the for
with a parfor
. I get pretty decent speedup with I run it on more cores (my tests are from 2 to 8 cores).
However, when I run the same integration in batch mode with:
function [fint] = int3d_batch_cluster(R0, Rf, N, cluster, ncores)
%%% note: This will not give back the same value as the serial or parpool version.
%%% If this was a legit integration, I would worry more about even dispersion
%%% of integration nodes per core, but I just want to benchmark right now so ... meh
Nr = N;
Nt = round(pi*N);
Np = round(2*pi*N);
rs = linspace(R0, Rf, Nr);
ts = linspace(0, pi, Nt);
ps = linspace(0, 2*pi, Np);
dr = rs(2)-rs(1);
dt = ts(2)-ts(1);
dp = ps(2)-ps(1);
C = 1/((4/3)*pi);
rns = floor( Nr/ncores )*ones(ncores,1);
RNS = zeros(ncores,1);
for icore = 1:ncores
if(sum(rns) ~= Nr)
rns(icore) = rns(icore)+1;
end
end
RNS(1) = rns(1);
for icore = 2:ncores
RNS(icore) = RNS(icore-1)+rns(icore);
end
rfs = rs(RNS);
r0s = zeros(ncores,1);
r0s(2:end) = rfs(1:end-1);
j = createJob(cluster);
for icore = 1:ncores
r0 = r0s(icore);
rf = rfs(icore);
rn = rns(icore);
trs = linspace(r0, rf, rn);
t{icore} = createTask(j, @int3d_ser, 1, {r0, rf, rn});
end
submit(j);
wait(j);
fints = fetchOutputs(j);
fint = 0.0;
for ifint = 1:length(fints)
fint = fint + fints{ifint};
end
end
I notice that it is much, much faster. Why would doing this integration in batch mode be different than doing it in parfor
?
For reference, I test the code with N
from small numbers like 10 and 20 (to get the constant in the polynomial approximation of runtime) to larger numbers like 1000 and 2000. This algorithm will scale cubicly since I assign the number of integration nodes in the theta
and phi
direction to be a constant multiple of the given N
.
For 2000 nodes, the parfor
version takes about 630 seconds, while the same number of nodes in batch mode takes about 19 seconds (where around 12 seconds is simply overhead communication that we also get for 10 integration nodes).