3

I am trying to compare speed and performance between Matlab and Julia. I am looking at a code that does topology optimization of a continuum structure subjected to a given load. The code I am looking at is the public code topopt88.m: https://www.topopt.mek.dtu.dk/Apps-and-software/Efficient-topology-optimization-in-MATLAB

Essentially it is an iterative algorithm where in every iteration a Ax=b system is solved (x=A\b), where A depends on the structural design (it is the finite element stiffness matrix) and it is updated in every iteration.

In Julia the same code runs slower than Matlab. I have done some code optimization in Julia, declaring types in function definitions, using functions as much as possible, avoiding global variables, and implementing other tips I found in the internet. But Julia is still slower than the same Matlab code (same in the sense of conceptual steps).

My question: since Matlab system solve "\" is multi threaded by default, is it true the same for Julia? If not, how to multi thread Julia's \ operator, or to get speed-ups from parallelization similarly?

shuli
  • 31
  • 3
  • 3
    Don't have a definite answer for you, but afaik just doing simple matrix algebra means you're most likely just benchmarking the underlying BLAS that gets called - MATLAB uses MKL I think and Julia OpenBLAS, which can be slower for many problems. You can try MKL.jl t see if this accounts for the difference. On an unrelated note, type annotations on functions do not matter for performance in Julia, they are for method dispatch. Optimized machine code is automatically generated for any concrete type once the function gets called. – Nils Gudat Sep 27 '20 at 18:28

2 Answers2

3

By default Julia is using BLAS/OpenBLAS and you can configure it to be multi-threaded. This requires running Julia in a multi-threaded setting and setting the BLAS.set_num_threads() configuration.

Here is how:

Before starting Julia:

set JULIA_NUM_THREADS=4

or on Linux

export JULIA_NUM_THREADS=4

Now lets test in a single thread:

julia> using BenchmarkTools, Random

julia> const b = rand(50);

julia> const A = rand(50,50);

julia> @btime A \ b;
  424.899 μs (4 allocations: 20.61 KiB)

Now the multi-threaded:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4)

julia> @btime A \ b;
  175.701 μs (4 allocations: 20.61 KiB)

You can see that on my machine the speedup is over 2x.

Yet another option for speedup would be to switch Julia to MKL.

julia> pkg"add https://github.com/JuliaComputing/MKL.jl"
julia> pkg "build MKL"
# restart Julia

I have not been using MKL.jl so if you give it a try, please write how it benchmarked.

Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • julia> using BenchmarkTools, Random julia> b = rand(50); julia> A=rand(50,50); julia> @btime A\b; 341.100 μs (4 allocations: 20.61 KiB) julia> using LinearAlgebra julia> BLAS.set_num_threads(1) julia> @btime A\b; 16.301 μs (4 allocations: 20.61 KiB) `julia> BLAS.set_num_threads(2) julia> @btime A\b; 47.300 μs (4 allocations: 20.61 KiB) julia> BLAS.set_num_threads(3) julia> @btime A\b; 77.899 μs (4 allocations: 20.61 KiB) julia> BLAS.set_num_threads(4) julia> @btime A\b; 140.001 μs (4 allocations: 20.61 KiB)` – shuli Sep 28 '20 at 08:43
  • To me it gives the opposite trend in the test: Increasing the number of threads increases the computational time... By the way, in my problem, the matrix A is sparse. – shuli Sep 28 '20 at 08:44
  • I have tried also in my reference problem mentioned in the first post, and by adding "set JULIA_NUM_THREADS=4" outside Julia, and "using LinearAlgebra; BLAS.set_num_threads(1)" inside Julia I get a reduction of computational time of 3 secs fomr the original code. Do you know what is happening!? :-) – shuli Sep 28 '20 at 08:48
  • That is why you should provide always a MWE. If your matrix is sparse you need to use `Pardiso.jl` - usage is straightforward. – Przemyslaw Szufel Sep 28 '20 at 08:54
  • Thank you @Przemyslaw . I have tried Pardiso and by creating a `ps=MKLPardisoSolver()` object. However performance is still slow, even if number of processes is changed with `set_nprocs!(ps, 2)`. The best performance I achieved was with `set JULIA_NUM_THREADS=1` outside Julia, and in the script `using LinearAlgebra BLAS.set_num_threads(1)`. The difference in terms of time is around 3 sec, for meshes 300x100 of finite elements that result in more or less in a 60000x60000 sparse A matrix. Still wondering why... – shuli Sep 28 '20 at 10:22
0

The following is the current best solution that I found and that gives the best performance to me. Outside Julia in a terminal in Windows type set JULIA_NUM_THREADS=4. Start Julia and use the Linear Algebra module this way: using LinearAlgebra BLAS.set_num_threads(1)

In this way there will be 4 threads, and by setting the BLAS threads to 1 it basically will not use its own pool that seems to conflict with Julia threads pool, or at least to slow the performance for me. I got some hint in this regard here: https://discourse.julialang.org/t/julia-threads-vs-blas-threads/8914/15 My current BLAS setting in Julia 1.5.1 is julia> BLAS.vendor() :openblas64

Using Pardiso.jl with the default MKL sparse solver didn't help. I still need to figure out if there is a way to increase performance there too and how.

shuli
  • 31
  • 3