1

I want to compute a distance matrix from a dictionary data like the following:

y = {"a": ndarray1, "b": ndarry2, "c": ndarry3}

The value of each key ("a", "b", "c") is a np.ndarry with different size. And I have a dist() function that can compute the distance between y["a"] and y["b"] through dist(y["a"], y["b"]).

So that the resulting distance matrix would be:

+----------------------------------------------------------------+
|                a        b                        c             |
+----------------------------------------------------------------+
| a  | 0        mydist(ndarrya1, ndarray)  mydist(ndarray1, ndarray3) |
| b  |          0                        mydist(ndarray2, ndarray3) |
| c  |                                   0                        |
+----------------------------------------------------------------+

I have tried scipy.spatial.distance.pdist with pdist(y, mydist), but got an error saying that:

[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
TypeError: float() argument must be a string or a number

Can anyone tell me how to implement this pdist by myself? I want to use the pdist result for further hierarchical clustering.

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
Rain Lee
  • 511
  • 2
  • 6
  • 11

1 Answers1

1

The first part of your question is quite clear. The second part I don't know what are you asking. Why do you need to re-implement scipy.spatial.distance.pdist, I thought you already have a dist() function to calculate the pairwise distance.

To get pairwise distance, when you already have a dist() function to calculate it:

In [69]:
D={'a':some_value,'b':some_value,'c':some_value}
In [70]:
import itertools
In [71]:
list(itertools.combinations(D,2))
Out[71]:
[('a', 'c'), ('a', 'b'), ('c', 'b')]

In [72]: #this is what you need:
[dist(*map(D.get, item)) for item in itertools.combinations(D,2)]
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • Seems to be what I was looking for. (1) How can I associate my entry name ("a","b") with the entry value in the above resulting dm. Since the dm would be [1, 2, 3], I want to know dm[3] is got from dist("b","c"). which I thought pdist() might be able to provide. And dm like [1,2,3] can not pass is_valid_y() function given by scipy.spatial.distance package. (2) What the meaning of "*" in your last line. The dist() function expect two input value "a" and "b", but the above code is trying to use it with one input value (a, b), is there quick solution for this? I could change my function though. – Rain Lee Mar 03 '14 at 16:52