3

At the moment i am using the pdist function in Matlab, to calculate the euclidian distances between various points in a three dimensional cartesian system. I'm doing this because i want to know which point has the smallest average distance to all the other points (the medoid). The syntax for pdist looks like this:

% calculate distances between all points
distances = pdist(m);

But because pdist returns a one dimensional array of distances, there is no easy way to figure out which point has the smallest average distance (directly). Which is why i am using squareform and then calculating the smallest average distance, like so:

% convert found distances to matrix of distances
distanceMatrix = squareform(distances);

% find index of point with smallest average distance
[~,j] = min(mean(distanceMatrix,2));

The distances are averaged for each column, and the variable j is the index for the column (and the point) with the smallest average distance.

This works, but squareform takes a lot of time (this piece of code is repeated thousands of times), so i am looking for a way to optimise it. Does anyone know of a faster way to deduce the point with the smallest average distance from the results of pdist?

  • Do you have the same number of points for each call of your mediod function? – reve_etrange Mar 12 '12 at 02:05
  • Also, do you need the pairwise distances themselves? Or just the identity of the mediod? – reve_etrange Mar 12 '12 at 02:56
  • @reve_etrange, the number of points is different from time to time. And in the end all i need is the coordinates of the medoid (so i dont actually need the distances themselves, but use them to calculate the medoid). –  Mar 12 '12 at 15:05
  • I experimented with precomputation of the indices. I think it can be faster if there are always the same number of points. Otherwise @yuk's answer is fastest AFAIK. I have nearly identical code in some of my functions. – reve_etrange Mar 14 '12 at 03:43

4 Answers4

3

I think for your task using SQUAREFORM function is the best way from vectorization view point. If you look at the content of this function by

edit squareform

You will see that it performs a lot of checks that take time of course. Since you know your input to squareform and can be sure it will work, you can create your custom function with just the core of squareform.

[r, c] = size(m);
distanceMatrix = zeros(r);
distanceMatrix(tril(true(r),-1)) = distances;
distanceMatrix = distanceMatrix + distanceMatrix';

Then run the same code as you did to find the medioid.

yuk
  • 19,098
  • 13
  • 68
  • 99
  • I've tried your suggestion, but i get the error: "*In an assignment A(I) = B, the number of elements in B and I must be the same.*", which originates in the line "*distanceMatrix(tril(true(r),-1)) = distances;*". Do you know how i can fix this? –  Mar 12 '12 at 16:03
  • 1
    Sorry, it should be `size(m)`. Make a correction in the answer. – yuk Mar 12 '12 at 16:08
  • That did it. Thanks for the quick reply, this actually sped things up quite a bit! –  Mar 12 '12 at 16:18
  • Where do you use the parameter `c`? I wrote this function `function distanceMatrix = squareform(distances) [r, c] = size(distances); distanceMatrix = zeros(r); distanceMatrix(tril(true(r),-1)) = distances; distanceMatrix = distanceMatrix + distanceMatrix'; end`. I am having some issue with it. `tomatrix` cannot be used with it. Still, not working without any parameters like calling `squareform(m)`. – Léo Léopold Hertz 준영 Mar 03 '16 at 11:54
1

When pdist computes distances between pairs of observations (1,2,...,n), the distances are arranged in the following order:

(2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))

To demonstrate this, try the following:

> X = [.2 .1 .7 .5]';
> D = pdist(X)
.1  .5  .3   .6  .4  .2

In this example, X stores n=4 observations. The result, D, is a vector of distances between observations (2,1), (3,1), (4,1), (3,2), (4,2), (5,4). This arrangement corresponds with the entries of the lower triangular part of the following n-by-n matrix:

M=
 0  0  0 0
.1  0  0 0
.5 .6  0 0
.3 .4 .2 0

Notice that D(1)=M(2,1), D(2)=(3,1) and so on. So, one way to get the pair of indices in M that correspond with D(k) would be to compute the linear index of D(k) in M. This could be done as follows:

% matrix size
n = 4;
% r(j) is the no. of elements in cols 1..j, belonging to the upper triangular part 
r = cumsum(1:n-1);       
% p(j) is the no. elements in cols 1..j, belonging to the lower triangular part 
p = cumsum(n-1:-1:1);
% The linear index of value D(k)
q = find(p >= k, 1);
% The subscript indices of value D(k)
[i j] = ind2sub([n n], k + r(q));

Notice that n, r and p need to be set only once. From that point, you can find the index for any given k using the last two lines. Let's check this:

for k = 1:6
   q = find(p >= k, 1);
   [i, j] = ind2sub([n n], k + r(q));
   fprintf('D(%d) is the distance between observations (%d %d)\n', k, i, j);
end

Here's the output:
D(1) is the distance between observations (2 1)
D(2) is the distance between observations (3 1)
D(3) is the distance between observations (4 1)
D(4) is the distance between observations (3 2)
D(5) is the distance between observations (4 2)
D(6) is the distance between observations (4 3)

edelburg
  • 33
  • 7
1

Here's an implementation that doesn't require a call to squareform:

N1 = 10;
dim = 5;

% generate points
X = randn(N1, dim);

% find mean distance
for iter=N1:-1:1
    d_mean(iter) = mean(pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean'));
    % D(iter,:) = pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean');
end

[val ind] = min(d_mean);

But without knowing more about your problem, I have no idea if it would be faster.

If this is the lynchpin for your program's performance, you may need to consider other speedup options like mex.

Good luck.

Sevenless
  • 2,805
  • 4
  • 28
  • 47
0

There is no need to use squareform:

distances = pdist(m);
l=length(distances);
n=(1+sqrt(1+4*l))/2;
m=[];
for i=1:n
  idx=[1+i:n:length(distances)];
  m(i)=mean(distances(idx));
end

j=min(m);

I am not sure, but maybe this can be vectorised as well, but now it is late.

  • 1
    There are a few problems here. For `N` points there are `N(N-1)/2` distances, not `N^2`. Additionally, `pdist` returns the flattened lower triangle of the full distance matrix, so the simple indexing scheme is incorrect. – reve_etrange Mar 12 '12 at 01:54
  • you should notice that I had written n=sqrt(distances) instead of length(distances). I'll correct this. thanks for the corrections. I'll edit it, but you get the idea. –  Mar 12 '12 at 18:04
  • You can't pre-compute the indices that way, because `pdist` gives you the unique, non-self distances. Look at the output of `example=zeros(10); cnt=0; for i=1:10; for j=i+1:10; cnt=cnt+1; example(j,i)=cnt; end; end;`, which demonstrates the indices into the `pdist` output vector. `N = 10` is also a counterexample to your formula. – reve_etrange Mar 14 '12 at 03:41