3

I have a numpy array points of shape [N,2] which contains the (x,y) coordinates of N points. I'd like to compute the mean distance of every point to all other points using an existing function (which we'll call cmp_dist and which I just use as a black box).

First a verbose solution in "normal" python to illustrate what I want to do (written from the top of my head):

mean_dist = []
for i,(x0,y0) in enumerate(points):
    dist = [
    for j,(x1,y1) in enumerate(points):
        if i==j: continue
        dist.append(comp_dist(x0,y0,x1,y1))
    mean_dist.append(np.array(dist).mean())

I already found a "better" solution using list comprehensions (assuming list comprehensions are usually better) which seems to work just fine:

mean_dist = [np.array([cmp_dist(x0,y0,x1,y1) for j,(x1,y1) in enumerate(points) if not i==j]).mean()
                            for i,(x0,y0) in enumerate(points)]

However, I'm sure there's a much better solution for this in pure numpy, hopefully some function that allows to do an operation for every element using all other elements.

How can I write this code in pure numpy/scipy?

I tried to find something myself, but this is quite hard to google without knowing how such operations are called (my respective math classes are quite a while back).

Edit: Not a duplicate of Fastest pairwise distance metric in python

The author of that question has a 1D array r and is satisfied with what scipy.spatial.distance.pdist(r, 'cityblock') returns (an array containing the distances between all points). However, pdist returns a flat array, that is, is is not clear which of the distances belong to which point (see my answer).

(Although, as explained in that answer, pdist is what I was ultimately looking for, it doesnt solve the problem as I've specified it in the question.)

Community
  • 1
  • 1
flotzilla
  • 1,181
  • 1
  • 13
  • 23
  • Take a look at [`scipy.spatial.distance.pdist`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) – ali_m Aug 19 '15 at 15:23
  • possible duplicate of [Fastest pairwise distance metric in python](http://stackoverflow.com/questions/20277982/fastest-pairwise-distance-metric-in-python) – ali_m Aug 19 '15 at 15:28

1 Answers1

1

Based on @ali_m's comment to the question ("Take a look at scipy.spatial.distance.pdist"), I found a "pure" numpy/scipy solution:

from scipy.spatial.distance import cdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist = np.sort(cdist(points,points,fct))[:,1:].mean(1)

definitely That's for sure an improvement over my list comprehension "solution".

What i don't really like about this, though, is that I have to sort and slice the array to remove the 0.0 values which are the result of computing the distance between identical points (so basically that's my way of removing the diagonal entries of the matrix I get back from cdist).

Note two things about the above solution:

  • I'm using cdist, not pdist as suggested by @ali_m.
  • I'm getting back an array of the same size as points, which contains the mean distance from every point to all other points, just as specified in the original question.

pdist unfortunately just returns an array that contains all these mean values in a flat array, that is, the mean values are unlinked from the points they are referring to, which is necessary for the problem as it I've described it in the original question.


However, since in the actual problem at hand I only need the mean over the means of all points (which I did not mention in the question), pdist serves me just fine:

from scipy.spatial.distance import pdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist_overall = pdist(points,fct).mean()

Though this would for sure be the definite answer if I had asked for the mean of the means, but I've purposely asked for the array of means for all points. Because I think there's still room for improvement in the above cdist solution, I won't accept this as THE answer.

flotzilla
  • 1,181
  • 1
  • 13
  • 23
  • You can use `scipy.spatial.distance.squareform` to convert the flat output of `pdist` to an N x N matrix of pairwise distances – ali_m Aug 19 '15 at 19:34
  • Having said that, passing a custom distance function to `pdist` will probably not be much faster than iterating over all pairs of points using standard Python `for` loops. You would probably see a *much* bigger performance improvement if you vectorized your `great_circle_distance` function to operate on multiple points at once. You haven't shown us the code for this, so I can't really make more specific recommendations. – ali_m Aug 19 '15 at 19:37
  • `squareform` sounds great, I'll try it out and edit the answer accordingly if it works. Regarding `great_circle_distance`: I already suspected that, but I'm also not really concerned about this at the moment. The runtime is not critical at that point, and if it will be at some point, I know where to begin. I just wanted to find an elegant numpy/scipy solution for this part of the code. (I've never looked at the code of `great_circle_distance`, as a matter of fact.) BTW: Thanks a lot for your suggestions! – flotzilla Aug 21 '15 at 07:51
  • I am having a similar issue. The calculation inside sklearn.metrics.silhouette_samples does this but it calculates the whole pairwise distance matrix before the average across points. This is bad on memory and I am trying to come up with the best way to recode it. Please see https://stackoverflow.com/questions/47702750/memoryerror-from-sklearn-metrics-silhouette-samples – Keith Dec 08 '17 at 05:18