How to parallelize computation of pairwise distance matrix?

Question

My problem is roughly as follows. Given a numerical matrix X, where each row is an item. I want to find each row's nearest neighbor in terms of L2 distance in all rows except itself. I tried reading the official documentation but was still a little confused about how to achieve this. Could someone give me some hint?

My code is as follows

function l2_dist(v1, v2)
    return sqrt(sum((v1 - v2) .^ 2))
end

function main(Mat, dist_fun)
    n = size(Mat, 1)

    Dist = SharedArray{Float64}(n) #[Inf for i in 1:n]
    Id = SharedArray{Int64}(n) #[-1 for i in 1:n]
    @parallel for i = 1:n
        Dist[i] = Inf
        Id[i] = 0
    end
    Threads.@threads for i in 1:n
        for j in 1:n
            if i != j
                println(i, j)
                dist_temp = dist_fun(Mat[i, :], Mat[j, :])
                if dist_temp < Dist[i]
                    println("Dist updated!")
                    Dist[i] = dist_temp
                    Id[i] = j
                end
            end
        end
    end
    return Dict("Dist" => Dist, "Id" => Id)
end

n = 4000
p = 30

X = [rand() for i in 1:n, j in 1:p];


main(X[1:30, :], l2_dist)
@time N = main(X, l2_dist)

I'm trying to distributed all the i's (i.e. calculating each row minimum) over different cores. But the version above apparently isn't working correctly. It is even slower than the sequential version. Can someone point me to the right direction? Thanks.

I would recommend getting that dictionary out of the inner loop before optimizing in any other way. Dictionaries are hazardous to runtime. If you really want a dictionary, make the arrays without the dictionary, and then add them into a dictionary after looping. That could help quite a bit. — Chris Rackauckas, Jun 29 '17 at 15:39
@ChrisRackauckas Thanks. I suppose I can just create the two arrays and only make them a dictionary at the last step. Do you have any hint about the parallelizing part? — user2804929, Jun 29 '17 at 16:15
"Do you have any hint about the parallelizing part?", other than to multithread it with `Threads.@threads` and then make it threadsafe by having different arrays per thread and merging afterwards? No, it should be pretty standard. If I have time I could write that up. — Chris Rackauckas, Jun 29 '17 at 16:23
@ChrisRackauckas Thanks a lot. I just tried revising the code as you suggested about the dictionary(updated in the question now). Just doing this gave me a 20x speedup. Amazing! — user2804929, Jun 29 '17 at 16:28
Is it intended to have `dist_fun(Mat[i, ], Mat[j, ])` instead of `dist_fun(Mat[i,:], Mat[j,:])`? The former is much faster, but the latter actually gives correct answers. Adding `@views` before the expression gives another speed/memory improvement. — Dan Getz, Jun 30 '17 at 10:57
@DanGetz thanks for your comment. Thanks for pointing out the bug. Was this what you meant by adding `@views` before the expression? `dist_temp = dist_fun(@views Mat[i, :], @views Mat[j, :])`. Thanks again. — user2804929, Jun 30 '17 at 19:16
@user2804929 It is enough to have one `@views` i.e. `dist_temp = @views dist_fun(Mat[i, :], Mat[j, :])`. Without the bug, the timings should also change, so if you update the question with some information about the new timings it would help. What is the intended distance function? (is it symmetric?) — Dan Getz, Jun 30 '17 at 19:25
@DanGetz I had the L2 distance above. That'll work for the purpose of this question. Also, I got an undefined variable error when I modified my code as you suggested "@views not defined". I was using julia 0.5.2. What might the problem be? — user2804929, Jun 30 '17 at 19:33
@user2804929 the @views macro is actually from 0.6 but Compat.jl has it too. So try `using Compat` first. Or use the `view` function directly. — Dan Getz, Jul 01 '17 at 04:29
@DanGetz This is a bit weird. So I tried using Compat and kept everything except changing that line to `dist_temp = @views dist_fun(Mat[i, :], Mat[j, :])`. It actually takes longer now. With the original code, with n = 10,000. I got `65.703088 seconds (589.72 M allocations: 127.985 GB, 18.79% gc time)`. With the revised version, it is `144.002589 seconds (1.18 G allocations: 83.134 GB, 15.84% gc time)` — user2804929, Jul 01 '17 at 04:44
@user2804929 Maybe the macros somehow add overhead in 0.5 (time to move to 0.6). In any case, try the non-macro version for the line: `dist_temp = dist_fun(view(Mat,i,:), view(Mat,j,:))` — Dan Getz, Jul 01 '17 at 06:00
Hi @ChrisRackauckas, I tried writing up the parallelized version (see updated code above). Yet I think I made some pretty foolish mistakes in there. For some reason, the code will only return the correct result on the second run. Also, this version doesn't provide any speedup compared to the sequential version. Can you see the mistake I am making? What should I be doing differently? Thanks! — user2804929, Jul 04 '17 at 02:56

score 2 · Accepted Answer · answered Jul 04 '17 at 19:27

Maybe you're doing something in addition to what you have written down, but, at this point from what I can see, you aren't actually doing any computations in parallel. Julia requires you to tell it how many processors (or threads) you would like it to have access to. You can do this through either

Starting Julia with multiple processors julia -p # (where # is the number of processors you want Julia to have access to)
Once you have started a Julia "session" you can call the addprocs function to add additional processors.
To have more than 1 thread, you need to run command export JULIA_NUM_THREADS = #. I don't know very much about threading, so I will be sticking with the @parallel macro. I suggest reading documentation for more details on threading -- Maybe @Chris Rackauckas could expand a little more on the difference.

A few comments below about my code and on your code:

I'm on version 0.6.1-pre.0. I don't think I'm doing anything 0.6 specific, but this is a heads up just in case.
I'm going to use the Distances.jl package when computing the distances between vectors. I think it is a good habit to farm out as many of my computations to well-written and well-maintained packages as possible.
Rather than compute the distance between rows, I'm going to compute the distance between columns. This is because Julia is a column-major language, so this will increase the number of cache hits and give a little extra speed. You can obviously get the row-wise results you want by just transposing the input.
Unless you expect to have that many memory allocations then that many allocations are a sign that something in your code is inefficient. It is often a type stability problem. I don't know if that was the case in your code before, but that doesn't seem to be an issue in the current version (it wasn't immediately clear to me why you were having so many allocations).

Code is below

# Make sure all processors have access to Distances package
@everywhere using Distances

# Create a random matrix
nrow = 30
ncol = 4000

# Seed creation of random matrix so it is always same matrix
srand(42)
X = rand(nrow, ncol)

function main(X::AbstractMatrix{Float64}, M::Distances.Metric)
    # Get size of the matrix
    nrow, ncol = size(X)

    # Create `SharedArray` to store output
    ind_vec = SharedArray{Int}(ncol)
    dist_vec = SharedArray{Float64}(ncol)

    # Compute the distance between columns
    @sync @parallel for i in 1:ncol
        # Initialize various temporary variables
        min_dist_i = Inf
        min_ind_i = -1
        X_i = view(X, :, i)

        # Check distance against all other columns
        for j in 1:ncol
            # Skip comparison with itself
            if i==j
                continue
            end

            # Tell us who is doing the work 
            # (can uncomment if you want to verify stuff)
            # println("Column $i compared with Column $j by worker $(myid())")

            # Evaluate the new distance... 
            # If it is less then replace it, otherwise proceed
            dist_temp = evaluate(M, X_i, view(X, :, j))
            if dist_temp < min_dist_i
                min_dist_i = dist_temp
                min_ind_i = j
            end
        end

        # Which column is minimum distance from column i
        dist_vec[i] = min_dist_i
        ind_vec[i] = min_ind_i
    end

    return dist_vec, ind_vec
end

# Using Euclidean metric
metric = Euclidean()
inds, dist = main(X, metric)

@time main(X, metric);
@show dist[[1, 5, 25]], inds[[1, 5, 25]]

You can run the code with

1 processor julia testfile.jl

% julia testfile.jl
  0.640365 seconds (16.00 M allocations: 732.495 MiB, 3.70% gc time)
  (dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

n processors (in this case 4) julia -p n testfile.jl

% julia -p 4 testfile.jl
  0.201523 seconds (2.10 k allocations: 99.107 KiB)
  (dist[[1, 5, 25]], inds[[1, 5, 25]]) = ([2541, 2459, 1602], [1.40892, 1.38206, 1.32184])

Thanks a lot for your answer! That's exactly what I was looking for! — user2804929, Jul 04 '17 at 20:10

How to parallelize computation of pairwise distance matrix?

1 Answers1