5

I'm doing some calculation in Julia and noticed it's running significantly slower (about 25 times!) than numpy counterpart.

Then I realized Julia is only using 8 threads out of total 96 CPU threads (48 physical cores) on my PC while numpy seems to be having no problem utilzing well over 70 threads.

Running Julia with $julia --thread 96 argument does not make any difference even though julia> Threads.nthreads() returns 96.

Furthermore, a little disapointment from the result is that I suspect Julia using all of the 96 threads still might not be able to matchup with numpy's speed.

Here is the Julia code. I simply measure the time with julia> @time calc_fid(mat_a, mat_b) which gave me 90 seconds average.

using Statistics
using LinearAlgebra

function calc(A::Array{Float32,2}, B::Array{Float32,2})
    μ_A = mean(A, dims=2)
    μ_B = mean(B, dims=2)
    σ_A = cov(A, dims=2)
    σ_B = cov(B, dims=2)

    ssdiff = sum((μ_A - μ_B).^2)
    covmean = sqrt(σ_A * σ_B)
    res = ssdiff + tr(σ_A .+ σ_B .- 2.0 * covmean)
    
    return res
end

Here is the numpy code that takes about 3.5 seconds average. Measured with time.perf_counter()

import numpy as np

from numpy import cov
from numpy import trace
from scipy.linalg import sqrtm

def calc(A, B):
    mu_A = A.mean(axis=0)
    mu_B = B.mean(axis=0)
    sigma_A = cov(A, rowvar=False)
    sigma_B = cov(B, rowvar=False)

    ssdiff = np.sum((mu_A - mu_B) ** 2.0)
    covmean = sqrtm(sigma_A.dot(sigma_B))
    res = ssdiff + trace(sigma_A + sigma_B - 2.0 * covmean)

    return res

Any suggestion/explanation would be greatly appreciated!

Kim
  • 75
  • 3

1 Answers1

6

Your Julia example code isn't actually using Julia's multithreading capabilities.

In Julia, if you want multithreading, you generally need to enable it explicitly. Functions like mean or cov will not automatically be multithreaded just by virtue of starting Julia with multiple threads; if you want these functions to multithread, you will need to either write multithreaded versions of them, or else use a package that has already done so.

The only thing in your Julia example code that will multithread at all as currently written would be the linear algebra, because that falls back to BLAS (as indeed likely does numpy), which has its own multithreading system entirely separate from Julia's multithreading (BLAS is written in Fortran). That is also probably not using all 96 threads either though, and in your case is almost certainly defaulting to 8. The number of threads used by BLAS in Julia can be checked with

using LinearAlgebra
BLAS.get_num_threads()

or analogously set with

BLAS.set_num_threads(n)
cbk
  • 4,225
  • 6
  • 27
  • Thanks for the explanation! With ```julia> LinearAlgebra.BLAS.set_num_threads(96)``` set, the code is now utilizing about a half of the threads or maybe a single CPU out of the dual CPU system. But I observed the code is using only a single thread outside the BLAS routine and the time is down to 90 to 73 seconds. – Kim Jul 09 '21 at 07:37
  • Yeah, that makes sense. If you want to be able to efficiently multithread the whole thing in Julia, you might check out the [Folds](https://github.com/JuliaFolds) ecosystem, or the threaded version of [LoopVectorization](https://github.com/JuliaSIMD/LoopVectorization.jl) (or the plain old [Threads](https://docs.julialang.org/en/v1/manual/multi-threading/) interface) – cbk Jul 10 '21 at 01:07