8

The format of my dataset: [x-coordinate, y-coordinate, hour] with hour an integer value from 0 to 23.

My question now is how can I cluster this data when I need an euclidean distance metric for the coordinates, but a different one for the hours (since d(23,0) is 23 in the euclidean distance metric). Is it possible to cluster data with different distance metrics for each feature in scipy? How?

Thank you

  • 1
    What clustering technique do you want to use? – YXD Sep 11 '13 at 10:15
  • Currently i'm experimenting with kmeans, but any clustering method that gives a good result is fine. – user2768102 Sep 11 '13 at 10:20
  • Are you confident it would converge? The way I would do it would be to monkey patch the [VQ function](https://github.com/scipy/scipy/blob/v0.12.0/scipy/cluster/vq.py#L134) with my own modifications based on the dictionary for each iteration. I don't think it would be overly difficult to do that. – Henry Gomersall Sep 11 '13 at 11:24
  • It should converge if the distance for the different metrics is well chosen, currently I'm trying to rewrite part of the kmeans algorithm so it can handle different distance metrics for each feature. Since I'm pretty new to python however this might take a while. But I have a feeling this is the only solution. – user2768102 Sep 11 '13 at 11:55
  • added a reply, than searched for what clustering was, n figured out that you dont really just want to calculate the distance between (x0,y0) and (x1,y1) on one side and time difference between (h0) and (h1) on the other side, but with one data structure - If thats what you want to do, i can undelete my reply though – usethedeathstar Sep 11 '13 at 14:21
  • thank you for your answer @usethedeathstar , but the problem is that the current implementation for kmeans doesn't let you pick the distance for each feature. [kmeans](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans) So the question remains: how do I make this clustering work? – user2768102 Sep 11 '13 at 14:25
  • i guess you could normalize everything so that everything has no unit, but that would get messy with the times needing modulo 24, (assuming only the time matters, and which day it is does not). Did you consider implementing the kmeans stuff in python yourself? Just look at the code how it works internally in scipy, and than change where you need, to incorporate the extra features being the modulo 24? edit: apparently that whiten thing takes care of that - i guess i run out of inspiration here ;-) – usethedeathstar Sep 11 '13 at 14:32
  • yeap that's what I'm trying now :). Just hoped there would be a better solution because my python isn't that great ;) – user2768102 Sep 11 '13 at 14:39

1 Answers1

3

You'll need to define your own metric, which handles "time" in an appropriate way. In the docs for scipy.spatial.distance.pdist you can define your own function

Y = pdist(X, f)

Computes the distance between all pairs of vectors in X using the user supplied 2-arity function f. [...] For example, Euclidean distance between the vectors could be computed as follows:

dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))

The metric can be passed to any scipy clustering algorithm, via the metric keyword. For example, using linkage:

scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean')
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • @user2768102 No problem, and welcome to Stack Overflow! Small tip for better posts, you don't need to say "Thank you/please/Cheers" in the post as we like to cut down the signal-to-noise ratio. – Hooked Sep 12 '13 at 13:47