What is the reason behind parfeval's time overhead compared to a serial implementation?

Question

I am trying to parallelize some code used in the Gauss-Seidel algorithm to approximate the solution of a Linear Equation System.

In brief, for an NxN matrix, during one iteration, I am doing sqrt(N) sessions of parallel computation, one by one. During one session of parallel computation, I distribute the task of calculating sqrt(N) values from a vector among the available workers.

The code involved in a parallel computation session is this:

future_results(1:num_workers) = parallel.FevalFuture;
for i = 1:num_workers
    start_itv = buck_bound+1 + (i - 1) * worker_length;
    end_itv = min(buck_bound+1 + i * worker_length - 1, ends_of_buckets(current_bucket));                 
    future_results(i) = parfeval(p, @hybrid_parallel_function, 3, A, b, x, x_last, buck_bound, n, start_itv, end_itv);
end
            
for i = 1:num_workers
    [~, arr, start_itv, end_itv] = fetchNext(future_results(i));               
    x(start_itv:end_itv) = arr;
end

The function called by parfeval is this:

function [x_par, start_itv, end_itv] = hybrid_parallel_function (A, b, x, x_last, buck_bound, n, start_itv, end_itv)
    x_par = zeros(end_itv - start_itv + 1, 1);
    for i = start_itv:end_itv
        x_par(i-start_itv+1) = b(i);
        x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, 1:buck_bound) * x(1:buck_bound);
        x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, buck_bound+1:i-1) * x_last(buck_bound+1:i-1);
        x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, i+1:n) * x_last(i+1:n);
        x_par(i-start_itv+1) = x_par(i-start_itv+1) / A(i, i);
    end
end

The entire code can be found here: https://pastebin.com/hRQ5Ugqz

The matlab profiler for a 1000x1000 matrix. The parallel code is between 20 to 135 times slower than its serial counterpart, depending on the chosen coefficient matrix (and still much faster than spmd).

The parfeval computation might be lazily split between the lines 50 and 57? Still, I cannot explain to myself why there is this major overhead. It seems to have something to do with the number of times parfeval is called: I did lower the execution time by lowering the parfeval calls.

Is there something that can be further optimized? Do I have to resort to writing the code in C++?

Please help. Thank you very much!

How did you configure your parallel pool? By default MATLAB uses multiple MATLAB sessions, and communicates data between them. This is designed for running on compute clusters, where each node in the cluster runs one session. You can configure it to use thread-based parallelism, though. In any case, fewer large jobs parallelize better than many small ones. — Cris Luengo, Apr 11 '21 at 19:29
You might be right, I will try parpool('threads') as soon as I update to R2020, thank you. — catalyst, Apr 12 '21 at 15:00

score 2 · Answer 1 · answered Apr 12 '21 at 11:01

There's a few possibilities here. Most importantly is the simple fact that if you're using the 'local' cluster type, then the workers are running in single-threaded code. In situations where the "serial" code is actually taking advantage of MATLAB's intrinsic multi-threading, then you're already taking full advantage of the available CPU hardware, and using parallel workers cannot gain you anything. It's not certain that this is the case for you, but I'd strongly suspect it given the code.

There are overheads to running in parallel, and as you've observed, running fewer parfeval calls lowers these overheads. Your code as written copies the whole of the A matrix to each worker multiple times. You dont' need to change A, so you could use parallel.pool.Constant to avoid those repeated copies.

While parfeval is more flexible, it tends to be less efficient than parfor in cases where parfor can be applied.

Yes, you can expect the workers to start working as soon as the first parfeval call has completed.

(Sorry, this isn't really a proper "answer", so some kind soul will probably come along and delete this soon, but there's much too much to fit into a comment).

Hi, thank you for answering! Hopefully I did write down parallel.pool.Constant correctly (https://pastebin.com/F5DX7KAR ). Unfortunately, the time hasn't improved at all. As you and Cris Luengo said, I might have to try parpool('threads'), which is only available starting from R2020a, so I have to wait a while to update. In the meantime, I did look at std::thread, and I think I might have a good chance with writing it in C++, even if it means botching the matrix multiplication part. — catalyst, Apr 12 '21 at 14:58
@catalyst: If you're going to do a C++ implementation, don't use `std::thread`, use any of the higher-level abstractions. I prefer OpenMP, but you can also look at Intel Treading Blocks or any number of libraries that simplify the problem of parallelizing algorithms. In OpenMP it's as simple as putting `#pragma parallel for` in front of your for loop. — Cris Luengo, Apr 12 '21 at 16:01

What is the reason behind parfeval's time overhead compared to a serial implementation?

1 Answers1