cosine similarity = 1.0 even when source/input string has extra tokens not seen in the corpus?

Question

I am using TfidfVectorizer and cosine_similarity from scikit-learn. When I have a new string and I try to find the cosine similarity to the strings in the original training corpus, I notice that the cosine similarity is 1.0 even when the string is an exact match plus additional novel tokens in the new string, no matter how many such additional tokens there are.

For example a new string of the form "a b x y z" will have cosine similarity 1.0 to an original string "a b" if x, y, and z are not in the original corpus at all.

I understand how this happens, because the novel tokens are ignored when vectorizing the new string according to the features established by the training corpus, but I want to be able to detect that "a b x y z" is NOT really a "perfect" match to "a b".

Any ideas on how I could incorporate something into the matching that would be sensitive to this type of difference (presence of novel tokens)?

Edit: here is an illustration based on the comments of @Arash:

The scenario I am describing tries to match novel input to the trained corpus:

corpus = (
        "The sky is blue",
        "The sun is bright",
        "The sun in the sky is bright",
        "We can see the shining sun, the bright sun"
        )
input = ("The sky is blue",
        "They say the sky is blue you know",
        )
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_corpus = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_corpus.shape)

tfidf_input = tfidf_vectorizer.transform(input)
print(tfidf_input.shape)

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_input, tfidf_corpus)

The output is

(4, 11)
(2, 11)
array([[1.        , 0.36651513, 0.52305744, 0.13448867],
       [1.        , 0.36651513, 0.52305744, 0.13448867]])

So you can see that both input strings/documents get a perfect similarity of 1.0 to the corpus's "The sky is blue" even though the second input ("They say the sky is blue you know") has several non-matching words (which happen not to appear in the corpus).

I would have expected that the cosine similarity for my second input item could be computed on an extended vector that adds elements for the novel words (each of which has a document frequency of 0 and so maximal IDF), so that the similarity would come out lower.

Simply including the new input data in the corpus is not a good general solution since we might receive new inputs one at a time in the future and we don't want to have to re-train the entire corpus every time we want to match one new input document/string.

This seems more appropriate for [Cross-validated](https://stats.stackexchange.com) as it asks for a methodology and not a programming issue in scikit-learn per se. — Vivek Kumar, Jun 01 '20 at 13:13

Arash · Answer 1 · 2020-06-04T04:58:51.760

-1

I can't replicate what you are describing. Try this:

    documents = (
    "The sky is blue",
    "The sky is blue you know",
    "The sun is bright",
    "The sun in the sky is bright",
    "We can see the shining sun, the bright sun"
    )

    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    print(tfidf_matrix.shape)

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

Your output will be

    array([[1.        , 0.67166626, 0.35369001, 0.50353381, 0.13245011]])

This indicates that the similarity between the first and second sentences is 0.67, not 1.0

EDITED The issue that you are describing is that the vectors are identical. You want the TF-IDF to take the new words and the frequency of the new words into account. So, you can just do this:

    # refit tfidf_vectorizer to the corpus and new documents
    tfidf_vectorizer.fit(input + corpus)
    # transform using the new model
    tfidf_input = tfidf_vectorizer.transform(input)

edited Jun 04 '20 at 04:58

answered Jun 01 '20 at 05:15

Arash

403
3
10

That's not quite my scenario. I have modified your example and added it to the original question to clarify my question. – dabru Jun 02 '20 at 12:35
@dabru: maybe you're confusing yourself by looking at the cosine similarity. In your updated question, just look at `tfidf_input.toarray()`. To paraphrase your question, the two vectors are identical and you don't want that because you want the new vocabulary to be taken into account. SO, just do exactly that. You'll need to add the new documents to the old ones and fit/transform again. – Arash Jun 04 '20 at 04:44
Take a real-time scenario where a new sentence comes in every few milliseconds and we must find its best match within the existing corpus and send it out within a few more milliseconds. We can't re-train the entire vocabulary for every input sentence. We wouldn't have to if vectorizer.transform() were to automatically extend its output by adding columns for any novel words/ngrams in its input, and then if cosine_similarity were to pad the shorter vector (the one from original corpus) with zeros before calculating the similarity. – dabru Jun 04 '20 at 15:57
You can do it, but you won't be dealing with TF-IDF vectors anymore. If you do re-fit it, all the vectors will be different. Here's what you can do: since the probability of getting new words will be smaller and smaller as your data grows, first tokenize your words by setting out of vocab (e.g. here https://stackoverflow.com/questions/48432300/using-keras-tokenizer-for-new-words-not-in-training-set) and then do TF-IDF transform. You can refit tokenizer and TF-IDF periodically to improve the accuracy of your distances. – Arash Jun 05 '20 at 04:33
See if ^ will have a reasonable output – Arash Jun 05 '20 at 04:35
Okay so you are saying that I could use Keras tokenizer which supports an "out of vocabulary" token, whereas scikit-learn lacks this capability? – dabru Jun 05 '20 at 12:47
I'd generally NOT rely on sklearn for my NLP tasks. Search for NLTK or Spacy with out of vocabulary or OOV and you'll find plenty of posts. – Arash Jun 06 '20 at 05:34

cosine similarity = 1.0 even when source/input string has extra tokens not seen in the corpus?

1 Answers1