0

I want to have a measure of similarity between two points in a cluster. Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?

Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.

(I am using Apache Spark MLlib)

val costA = model.computeCost(A)
val costB = model.computeCost(B)

val dissimilarity = |cost(A)-cost(B)|

Dissimilarity i.e. the higher the value, the more unlike each other they are.

Mnemosyne
  • 1,162
  • 4
  • 13
  • 45
  • and what is "A" and model.computeCost? For sure it cannot be called similarity, since it is 0 when points are the same, and bigger otherwise, but to be more specific you have to provide more details about what these objects/functions are. – lejlot Sep 28 '16 at 21:28
  • just added some more details. Sorry for the vagueness. – Mnemosyne Sep 28 '16 at 21:31
  • that's actually more of a dissimilarity actually. The higher the value, the more dissimilar they are. – Mnemosyne Sep 28 '16 at 21:33

2 Answers2

1

If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.

For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|

Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).

Long(er) proof:

  • d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
  • d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
  • d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
  • d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.

If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is

computes the sum of squared distances between the input point and the corresponding cluster center

this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).

lejlot
  • 64,777
  • 8
  • 131
  • 164
1

Your "cost" is actually the distance to the center.

Points that have the same distance to the center are considered to be identical (distance 0), which creates a really odd pseudonetric, because it ignores where on the circle of that distance points are.

It's not very likely this will work on your problem.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194