Measuring distance between vectors

Question

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.

Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes sense to do) and calculate some divergence measure.

I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?

Sorry for the weak question, but I had to start somewhere...

Thank you!

score 0 · Answer 1 · answered Oct 07 '13 at 10:10

Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?

If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.

Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.

score 0 · Answer 2 · answered May 21 '15 at 16:55

If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)

Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Measuring distance between vectors

2 Answers2