9

I have a TF-IDF matrix of a dataset of products:

tfidf = TfidfVectorizer().fit_transform(words)

where words is a list of descriptions. This produces a 69258x22024 matrix.

Now I want to find cosine similarities between a new product and the ones in the matrix, as I need to find the 10 most similar products to it. I vectorize it using the same method above.

However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so a 1x6 matrix), so I need to make a TFIDFVectorizer with the number of columns as the original one.

How do I do it?

Mohamed Oun
  • 561
  • 1
  • 9
  • 24

2 Answers2

14

I have found a way for it to work. Instead of using fit_transform, you need to first fit the new document to the corpus TFIDF matrix like this:

queryTFIDF = TfidfVectorizer().fit(words)

Now we can 'transform' this vector into that matrix shape by using the transform function:

queryTFIDF = queryTFIDF.transform([query])

Where query is the query string.
We can then find cosine similarities and find the 10 most similar/relevant documents:

cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]
Mohamed Oun
  • 561
  • 1
  • 9
  • 24
  • Good catch. I tested with a large corpora (1MM words) and the query time took less than 1.58s. – Flavio Jul 01 '19 at 11:53
  • Does final list which is being passed to tfidfvectorizer contains only lemmatized words? Can we use custom preprocessing function and then make list of lemmatized words as corpus and pass it in tfidfvectorizer, will it give me vector for each sentence? – loving_guy Jul 05 '20 at 07:29
5

I think words variable is ambiguous. I advise you to rename words to corpus.

In fact you put all your documents in corpus variable first and after you compute your cosinus similarity.

Here an example :

tf_idf.py:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)

Execute that in your ipython console :

In [1]: run tf_idf.py

In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [3]: tfidf.toarray()
Out[3]: 
array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])

In [4]: similarity_matrix
Out[4]: 
array([[ 1.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.43830038,  1.        ,  0.06422193,  0.43830038],
       [ 0.1034849 ,  0.06422193,  1.        ,  0.1034849 ],
       [ 1.        ,  0.43830038,  0.1034849 ,  1.        ]])

Note :

  • tfidf is a scipy.sparse.csr.csr_matrix, to_array convert to a numpy.ndarray (but is is costly, just here to see easily the content).
  • similarity_matrix is a symetric matrix.

You can do:

import numpy as np
print(np.triu(similarity_matrix, k=1))

Give :

array([[ 0.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.        ,  0.        ,  0.06422193,  0.43830038],
       [ 0.        ,  0.        ,  0.        ,  0.1034849 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]]) 

To see only interesting similarities.

See :

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

glegoux
  • 3,505
  • 15
  • 32
  • 1
    Thanks, but I want to do it without adding the new query/document to the dataset. – Mohamed Oun Jul 01 '17 at 16:53
  • I want to point out that this approach is incorrect, since this gives you train/test bleed if you'd apply it to a real task, because you fit the tf-idf vectorizer on the test data. The accepted answer is correct in that it only calls transform on the test data. – amdex Jul 08 '19 at 07:00