0

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.

  • Simple Euclidian distance
  • Cosine similarity
  • Use a kernel (for instance Gaussian) to calculate the Gram matrix.
  • Treat the vector as a discrete probability distribution (which makes sense to do) and calculate some divergence measure.

I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?

Sorry for the weak question, but I had to start somewhere...

Thank you!

halfdanr
  • 373
  • 4
  • 11

2 Answers2

0

Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?

If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.

Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.

Ross Hemsley
  • 576
  • 5
  • 14
0

If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)

Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

sudar
  • 111
  • 1
  • 5