Speeding up distance matrix computation with Numpy and Cython

Question

Consider a numpy array A of dimensionality NxM. The goal is to compute Euclidean distance matrix D, where each element D[i,j] is Eucledean distance between rows i and j. What is the fastest way of doing it? This is not exactly the problem I need to solve, but it's a good example of what I'm trying to do (in general, other distance metrics could be used).

This is the fastest I could come up with so far:

n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
    D[i] = np.sqrt(np.square(A-A[i]).sum(1))

But is it the fastest way? I'm mainly concerned about the for loop. Can we beat this with, say, Cython?

To avoid looping, I tried to use broadcasting, and do something like this:

D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))

But it turned out to be a bad idea, because there's some overhead in construction an intermediate 3D array of dimensionality NxNxM, so the performance is worse.

I tried Cython. But I am a newbie in Cython, so I don't know how good is my attempt:

def dist(np.ndarray[np.int32_t, ndim=2] A):
    cdef int n = A.shape[0]    
    cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)      
    cdef int i = 0    
    for i in range(n):  
        dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)              
    return dm

The above code was a bit slower than Python's for loop. I don't know much about Cython, but I assume I could achieve at least the same performance as the for loop + numpy. And I am wondering whether it is possible to achieve some noticeable performance improvement when done the right way? Or whether there's some other way to speed this up (not involving parallel computations)?

How big are N and M? Doing an N-loop in Python instead of NumPy will of course slow you down, but it's not nearly as bad as doing an NxM-loop. Is it really too slow, or are you just optimizing for the hell of it? — abarnert, Aug 08 '14 at 23:28
Also, for this, it might be easier to write a ufunc in Cython, then blast it over `A`, instead of putting the whole loop in Cython. Less to get wrong that way, if nothing else… — abarnert, Aug 08 '14 at 23:31
[There's a SciPy method specifically for performing this task](http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html), so that's probably a pretty fast option. — user2357112, Aug 08 '14 at 23:31
@user2357112, yea, just tried scipy, it was pretty fast, thanks. But I still need to figure out how to implement this, because this is just an example of a more general problem that I have. — ojy, Aug 08 '14 at 23:38
@abarnert, N and M can be pretty large, say N ~ 20k, M ~ 1k. — ojy, Aug 08 '14 at 23:40
As for Cython, if you're using that, you probably want to do the math yourself rather than calling NumPy routines. NumPy vectorization doesn't help so much when you're already writing code that compiles to C. — user2357112, Aug 09 '14 at 00:07
possible duplicate of [Fastest pairwise distance metric in python](http://stackoverflow.com/questions/20277982/fastest-pairwise-distance-metric-in-python) — YXD, Aug 09 '14 at 00:22
@Mr E, I think it's a different question. My question is a generalization of that question, in a sense that pairwise distances in a question you mention are distances between scalars. While in my case it's vectors. Answers from that question might be insightful though. — ojy, Aug 09 '14 at 00:30

ali_m · Accepted Answer · 2014-08-09T13:35:52.190

The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.

There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.

Here's a quick stab at a more optimized Cython version of your distance function:

import numpy as np
cimport numpy as np
cimport cython

# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt

# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):

    # declare C types for as many of our variables as possible. note that we
    # don't necessarily need to assign a value to them at declaration time.
    cdef:
        # Py_ssize_t is just a special platform-specific type for indices
        Py_ssize_t nrow = A.shape[0]
        Py_ssize_t ncol = A.shape[1]
        Py_ssize_t ii, jj, kk

        # this line is particularly expensive, since creating a numpy array
        # involves unavoidable Python API overhead
        np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)

        double tmpss, diff

    # another advantage of using Cython rather than broadcasting is that we can
    # exploit the symmetry of D by only looping over its upper triangle
    for ii in range(nrow):
        for jj in range(ii + 1, nrow):
            # we use tmpss to accumulate the SSD over each pair of rows
            tmpss = 0
            for kk in range(ncol):
                diff = A[ii, kk] - A[jj, kk]
                tmpss += diff * diff
            tmpss = sqrt(tmpss)
            D[ii, jj] = tmpss
            D[jj, ii] = tmpss  # because D is symmetric

    return D

I saved this in a file called fastdist.pyx. We can use pyximport to simplify the build process:

import pyximport
pyximport.install()
import fastdist
import numpy as np

A = np.random.randn(100, 200)

D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)

print np.allclose(D1, D2)
# True

So it works, at least. Let's do some benchmarking using the %timeit magic:

%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop

%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop

A ~9x speed-up is nice, but not really a game-changer. As you said, though, the big problem with the broadcasting approach is the memory requirements of constructing the intermediate array.

A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop

I wouldn't recommend trying that using broadcasting...

Another thing we could do is parallelize this over the outermost loop, using the prange function:

from cython.parallel cimport prange

...

for ii in prange(nrow, nogil=True, schedule='guided'):
...

In order to compile the parallel version you'll need to tell the compiler to enable OpenMP. I haven't figured out how to do this using pyximport, but if you're using gcc you could compile it manually like this:

$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
   -Wall -fno-strict-aliasing  -I/usr/include/python2.7 -o fastdist.so fastdist.c

With parallelism, using 8 threads:

%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop

@ojy Glad you found it helpful. I just realized my initial version was rather inefficient, since it looped over every element in `D` rather than just the upper triangle. The updated single-threaded version is about twice as fast again. — ali_m, Aug 09 '14 at 13:38
Yea, I noticed that, but have not ha a chance to try it out yet, can't wait till Monday :) Thanks a bunch! — ojy, Aug 09 '14 at 19:48
tried it eventually! Worked great! I looked a little more on how to further improve parallelization, because it was resulting only in about 3x speedup, even on the 24 cores that I have access to. Found an answer by Saullo Castro from this question http://stackoverflow.com/questions/19002486/cython-shared-memory-in-cython-parallel-prange-block very useful. The idea is to have a separate routine that will be called in parallel, and only pointers to the data array are to be passed. It gave me additional 5x speedup. — ojy, Aug 12 '14 at 01:25
@ojy I'm slightly surprised that it made that big of a performance difference for you, although I suppose that may depend on a bunch of other factors including your compiler. One other thing that sometimes helps is declaring your distance function to be `inline` (e.g. `cdef inline void mydist(...) nogil:`), which gives the C compiler an extra hint to optimise that function (usually by substituting the code of the function into its caller). You might also try varying the number of OpenMP threads you're using - 24 is probably excessive unless `A` is very large. — ali_m, Aug 12 '14 at 19:32
After your comment, I couldn't reproduce this problem anymore. But then I figured out what's happening. When N >> M (e.g. N = 10000, M = 5), the line `D[jj, ii] = tmpss` seems to be causing the problem. When I commented it out, I got 3x speed up. Probably the slow down is when we are trying to write elements in the same row of D from 2 threads. Maybe it happens more of then when N >> M. I re-declared D to be a 1D memryview `double[::1] D = np.zeros((nrow * nrow), dtype = np.double)`, and it seemed to fix the problem. — ojy, Aug 15 '14 at 21:18
You're never writing to the same element in `D` from multiple threads - each thread is filling a non-overlapping L-shaped chunk starting in the top left corner. I expect the main problem is poor spatial locality of reference. Outputting to a 1D vector ought to be much better in this regard, since it's a contiguous block of memory. You could initialize `D` to the length of the upper triangle `(nrow * (nrow + 1) / 2)` and fill the elements sequentially. `scipy.spatial.distance.squareform()` can be used to efficiently convert between the upper triangle and the full symmetric matrix. — ali_m, Aug 15 '14 at 21:59
I would expect performance to be worse if `N >> M` - trivially the output matrix is larger, and, since each individual row completes faster, the overhead from thread creation/termination likely consumes a much larger proportion of total CPU time. If you have more queries about performance I'm more than happy to try and help, but it's probably a good idea to ask a separate question so that the comments here don't get too cluttered. — ali_m, Aug 15 '14 at 22:00
Thanks for your help! I understand that the same element is never accessed from different threads, but I still suspect that it might have something to do with accessing the same rows. It doesn't seem that your second comment is the case - I made 2 versions of the same function, but in one I commented out `D[ii, jj] = tmpss`, and in the other I commented out `D[jj, ii] = tmpss`, and there was a consistent 2x difference in performance. It's a good idea to create another question, as I really want to understand what's going on there :) I'll give you a link once I do it. Probably next week though. — ojy, Aug 15 '14 at 22:21
Sorry, I wasn't clear enough - the difference between `D[ii, jj]` and `D[jj, ii]` is almost certainly due to locality of reference - numpy arrays are row-major by default, so `D[ii, jj]` should be faster (but 1D will be faster still). — ali_m, Aug 15 '14 at 22:25

Speeding up distance matrix computation with Numpy and Cython

1 Answers1

Linked