I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
- Simple Euclidian distance
- Cosine similarity
- Use a kernel (for instance Gaussian) to calculate the Gram matrix.
- Treat the vector as a discrete probability distribution (which makes sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!