Why doesn't Eigen support OpenMP when it comes to coefficient-wise operations? Is it some sort of a blunder by the developer or are there some specific reasons to avoid multiprocessing for specific operations?
Using multiple threads by default would not be a good idea as users can already use multiple threads in their applications and having nested parallel loops is clearly not efficient. Moreover, sharing the work for each operation can introduce a significant overhead that is not great for basic operations small/medium-sized arrays. Eigen is meant to be fast for both small and big arrays. Using OpenMP on top of Eigen is better in practice. This is especially true one Numa systems due to the first touch policy: hiding the multithreading can introduce surprising overheads due to the remote accesses or page migrations.
For complex operations like LU decomposition, this is not reasonable to ask to the user to parallelise the operation as it would need to rewrite most of the algorithm. This is why Eigen try to parallelize such algorithms.
Can I manually include OpenMP support for such operations? The code for Eigen seems complicated so it is hard to find the exact implementation of a particular function, even through the use of Visual Studio instruments.
Element-wise operation are trivial to implement in parallel in OpenMP. You can simply use a #pragma omp parallel for
directive (possibly combined with a collapse
clause) on a loops iterating for 0 to array.rows()
(or array.cols()
) itself using basic Eigen array indexing operators.
Note that while using multiple threads for basic operations seems interesting at first glance, in practice, it is often disappointing in term of speed up on most machines. Indeed, reading/writing data in the RAM is very expensive compare to applying arithmetic operations (eg. add/sub/mul) on each items as long as the operation is already vectorized using SIMD instruction. For example, on my machine 1 core can reach a throughput of 20-25 GiB/s while the maximum throughput is 40 GiB/s with 6 cores (ie. speed-up less than 2x with 6x threads).