2

I'm solving a PDE using an implicit scheme, which I can divide into two matrices at every time step, that are then connected by a boundary condition (also at every time step). I'm trying to speed up the process by using multi-processing to invert both matrices at the same time.

Here's an example of what this looks like in a minimal (non-PDE-solving) example.

using Distributed
using LinearAlgebra

function backslash(N, T, b, exec)
    A = zeros(N,N)
    α = 0.1
    for i in 1:N, j in 1:N
        abs(i-j)<=1 && (A[i,j]+=-α)
        i==j && (A[i,j]+=3*α+1)
    end

    A = Tridiagonal(A)
    a = zeros(N, 4, T)

    if exec == "parallel"
        for i = 1:T
            @distributed for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    elseif exec == "single"
        for i = 1:T
            for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    end
    return a
end

b = rand(1000, 1000)

a_single = @time backslash(1000, 1000, b, "single");
a_parallel = @time backslash(1000, 1000, b, "parallel");

a_single == a_parallel

Here comes the problem: the last line evaluate to true, with an 6-fold speed-up, however, only 2-fold should be possible. What am I getting wrong?

lhcgeneva
  • 1,981
  • 2
  • 21
  • 29

1 Answers1

1
  1. You are measuring compile time
  2. Your @distributed loop exits prematurely
  3. Your @distributed loop does not collect the results

Hence obviously you have:

julia> addprocs(2);

julia> sum(backslash(1000, 1000, b, "single")), sum(backslash(1000, 1000, b, "parallel"))
(999810.3418359067, 0.0)

So in order to make your code work you need to collect the data from the distributed loop which can be done as:

function backslash2(N, T, b, exec)
    A = zeros(N,N)
    α = 0.1
    for i in 1:N, j in 1:N
        abs(i-j)<=1 && (A[i,j]+=-α)
        i==j && (A[i,j]+=3*α+1)
    end

    A = Tridiagonal(A)
    a = zeros(N, 4, T)

    if exec == :parallel
        for i = 1:T
           
            aj = @distributed (append!) for j = 1:2
                [A\b[:, i]]
            end
            # you could consider using SharedArrays instead
            a[:, 1, i] .= aj[1]
        a[:, 2, i] .= aj[2]
        end
    elseif exec == :single
        for i = 1:T
            for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    end
    return a
end

Now you have equal results:

julia>  sum(backslash2(1000, 1000, b, :single)) == sum(backslash2(1000, 1000, b, :parallel))
true

However, the distributed code is very inefficient for loops that take few milliseconds to execute so the @distributed code will take in this example many times longer to execute as you run it 1000 times and it takes something like few milliseconds to dispatch a distributed job each time.

Perhaps your production task takes longer so it than makes sense. Or maybe you will consider Threads.@threads instead.

Last but not least BLAS might be configured to be multi-threaded and in this scenario on a single machine it might make no sense to parallelize (depends on the scenario)

Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • Thanks! I understand that [@time](https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-[@time](@ref)-and-pay-attention-to-memory-allocation) measures both, compile time and execution time. However, why would it be faster in the case where @distributed is prepended (although as you point out it doesn't collect the results)? It still clearly does the computation, no? I'll check out threads! – lhcgeneva Jul 13 '22 at 07:07
  • Ah, part of the problem is that I wasn't using `addprocs(2)`. However, without it, the version with @distributed is still faster than the one without. – lhcgeneva Jul 13 '22 at 07:34
  • 1
    your code was not collecting the results at all (note that the result was 0.0). Moreover, if you do @distributed without an aggregator you need to do `@sync`. Finally note your `a` array was just getting copied among processes - if you want to have a shared memory across processes you would need to use `SharedArray` instead. Code like yours would more or less work if you decided to use multithreading instead of multiprocessing. – Przemyslaw Szufel Jul 13 '22 at 11:28
  • I agree with all of the above, except the fact that the result was 0.0. If I run my originally posted code without previously running `addprocs(2)` it gets collected and is faster. That's what I don't get. – lhcgeneva Jul 13 '22 at 11:51
  • try `using Distributed;addprocs(2);function f();a=zeros(2);@sync @distributed for i=1:2;a[i]=i;end;a;end;f()` and see what you get – Przemyslaw Szufel Jul 13 '22 at 16:23
  • Same result regarding the output. If I do `addprocs(2)` I get 0.0 0.0. If I don't do `addprocs(2)` I get 1.0 2.0. However, this time `@time` gives reverse results (as you would normally expect): it's faster without `@sync @ distributed`. Not sure why that's different from the original code. – lhcgeneva Jul 14 '22 at 07:37
  • `@distributed` takes a lot of time and is not good for millisecond jobs. – Przemyslaw Szufel Jul 14 '22 at 14:19
  • I understand that. But why is it faster then, if it runs 1 process with distributed, than running 1 process without distributed? If anything distributed should add a little time, even running with a single process, but not run faster. Anyway, answer was helpful... – lhcgeneva Jul 15 '22 at 08:23