I am using TfidfVectorizer and cosine_similarity from scikit-learn. When I have a new string and I try to find the cosine similarity to the strings in the original training corpus, I notice that the cosine similarity is 1.0 even when the string is an exact match plus additional novel tokens in the new string, no matter how many such additional tokens there are.
For example a new string of the form "a b x y z" will have cosine similarity 1.0 to an original string "a b" if x, y, and z are not in the original corpus at all.
I understand how this happens, because the novel tokens are ignored when vectorizing the new string according to the features established by the training corpus, but I want to be able to detect that "a b x y z" is NOT really a "perfect" match to "a b".
Any ideas on how I could incorporate something into the matching that would be sensitive to this type of difference (presence of novel tokens)?
Edit: here is an illustration based on the comments of @Arash:
The scenario I am describing tries to match novel input to the trained corpus:
corpus = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
input = ("The sky is blue",
"They say the sky is blue you know",
)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_corpus = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_corpus.shape)
tfidf_input = tfidf_vectorizer.transform(input)
print(tfidf_input.shape)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_input, tfidf_corpus)
The output is
(4, 11)
(2, 11)
array([[1. , 0.36651513, 0.52305744, 0.13448867],
[1. , 0.36651513, 0.52305744, 0.13448867]])
So you can see that both input strings/documents get a perfect similarity of 1.0 to the corpus's "The sky is blue" even though the second input ("They say the sky is blue you know") has several non-matching words (which happen not to appear in the corpus).
I would have expected that the cosine similarity for my second input item could be computed on an extended vector that adds elements for the novel words (each of which has a document frequency of 0 and so maximal IDF), so that the similarity would come out lower.
Simply including the new input data in the corpus is not a good general solution since we might receive new inputs one at a time in the future and we don't want to have to re-train the entire corpus every time we want to match one new input document/string.