0

Suppose I have an object X with a set of 10 features: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

Then, I have two more objects:

  • A : [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
  • B : [0, 0, 0, 0, 0, 0, 0, 0, 0, 20]

I need to know which from A or B is "closer" to X.

The idea I have in mind behind "similarity" is:

It is better that all features are nearly the same, rather than many are very close but some very different.

According to this "definition", A seems closer to X than B.

However, the arithmetic mean does not seem to be the right tool to implement this idea because it is 2 for both objects.

Is there a particular metric for this kind of problem, please?

Delgan
  • 18,571
  • 11
  • 90
  • 141

3 Answers3

1

What about the euclidean distance?

In your case, the Euclidean distance between A and X is the square root of 40 (= 6.32 approximately) and the distance between B and X is 20, so A is indeed more similar by that metric.

jrsala
  • 1,899
  • 12
  • 13
1

You could also consider using cosine similarity. Cosine similarity measures the similarity of vectors with respect to the origin, while Euclidean distance measures the distance between particular points of interest along the vector.

Here is a great article on when to pick one over the other.

Another common measure is Jaccard similarity. Here is an article comparing cosine to Jaccard similarity.

Community
  • 1
  • 1
CJ Sullivan
  • 246
  • 3
  • 13
0

In the case where the features are very unsimilar and may vary differently, the euclidian distance have to be normalized.

This can be done using the Mahalanobis distance which involves the variance of the features.

Mahalanobis distance

Also, see this question.

Community
  • 1
  • 1
Delgan
  • 18,571
  • 11
  • 90
  • 141