Normalize the similarity between word vectors and document vectors?

Question

Cosine similarity is broadly used for measuring the similarity between two vectors, where two could be word vectors or document vectors.

Others, like manhattan, euclidean, minkowski, etc, are also popular.

Cosine similarity gives the number between 0 and 1 so it SEEMS like it is a percentage of the similarity between two vectors. Euclidean gives some number in large variation.

.

When cosine similarity between two vectors gives 0.78xxx, people including me probably expect that "these two vectors are 78 % similar!", which it is not actual "similarity degree" of two vectors.

.

Unlike cosine similarity, minkowski, manhattan, canberra, etc even give some large number not ranged in 0 to 1.

For word1:word2 example
0.78 (cosine, gives between 0 to 1)
9.54 (Euclidean, gives the actual distance between two vectors)
158.417 (Canberra)

.

I expect that there might be some normalization methods broadly used to represent the actual "similarity degree" between two vectors. Please provide if you know some. If there are articles or papers, it would be much better.

For word1:word2 example
0.848 (cosine, transformed as normalized number)
0.758 (Euclidean, normalized between 0 to 1)
0.798 (Canberra, normalized between 0 to 1)

I do not expect you to mention about the softmax number because I read an article that the softmax number itself should not be considered as the actual percentage.

score 0 · Answer 1 · answered May 15 '18 at 16:41

0

You would have to rigorously define what you mean by "actual 'similarity degree'" for any answer to be possible.

Each of these measures can be useful. Each could be scaled to a value from 0.0 to 1.0, if you needed things in that range. But that wouldn't necessarily make any of them a "percentage similarity", because "percentage similarity" isn't a concept with a rigorous meaning.

answered May 15 '18 at 16:41

gojomo

52,260
14
86
115

Hi, gojomo. I understand that it is almost impossible to make scores into the fixed quantity in some range, because they are all in and from relative domain. So each situation the similarity could be totally different than the others when given exactly same two sentences exist in those different domains. For example, cosine similarity between sentence A and B is about 0.87 in domain #1 but 0.51 in domain #2. That is why I wonder 1. if there is any method to generalize the similarity measures like this case 2. change the scores from other algorithms to be scaled from 0.0 to 1.0 – Isaac Sim May 16 '18 at 02:23
I am not sure if the intention what I asked is well delivered to you. It seems like you understand what I was asking though. – Isaac Sim May 16 '18 at 02:24
You're right that the same sentences, sentence A and sentence B, will have different cosine-similarities in different models – dependent, for example, on what *other* sentences were included in training, changing senses of words, the number of vector dimensions, the number of `negative` examples, etc. But it's still not clear what you want to do about that - there's no 'one true similarity' to which each could be converted, it's all dependent on data, parameters, goals, etc. – gojomo May 16 '18 at 20:40
If your real aim is just "make wider ranges map to [0.0, 1.0]", there are lots of ways to do that, but none that are best for all uses, especially if you then want to compare them to other 0.0-1.0 values that have different distributions. For example, if you know the max value M for a value, just divide all values by M – voila, now all values are <= 1.0. Another possibility is `v' = 1 - ( 1 / (1+v))` – turns all positive values into values from 0 to 1. But whether those are comparable to other 0-1 values, or all clumped on one end, depends on the details. – gojomo May 16 '18 at 20:45
You can also see https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range – and also note sometimes a log operation can be mixed in to better retain contrast over long value ranges. – gojomo May 16 '18 at 20:46
Okay, I understand the point you are talking about. It is almost impossible to define the 'real similarity' between two sentence vectors(or word vectors) - which is about the degree similarity. Then how about the 'probability(chance) that two sentences vectors are similar'? I know that this new question is a bit shifted from the original question, but I think it is still worth to dig into. What if I am aiming to find out the 'probability' of that two sentences are similar? I think there should be some approaches, do you know any or have you thought of it? – Isaac Sim May 17 '18 at 02:07
By the way, thanks for the link. I will probably take it to use since I have no other solutions. – Isaac Sim May 17 '18 at 02:09

Normalize the similarity between word vectors and document vectors?

1 Answers1