I think words
variable is ambiguous. I advise you to rename words
to corpus
.
In fact you put all your documents in corpus
variable first and after you compute your cosinus similarity.
Here an example :
tf_idf.py:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)
Execute that in your ipython
console :
In [1]: run tf_idf.py
In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
In [3]: tfidf.toarray()
Out[3]:
array([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
In [4]: similarity_matrix
Out[4]:
array([[ 1. , 0.43830038, 0.1034849 , 1. ],
[ 0.43830038, 1. , 0.06422193, 0.43830038],
[ 0.1034849 , 0.06422193, 1. , 0.1034849 ],
[ 1. , 0.43830038, 0.1034849 , 1. ]])
Note :
tfidf
is a scipy.sparse.csr.csr_matrix
, to_array
convert to a numpy.ndarray
(but is is costly, just here to see easily the content).
- similarity_matrix is a symetric matrix.
You can do:
import numpy as np
print(np.triu(similarity_matrix, k=1))
Give :
array([[ 0. , 0.43830038, 0.1034849 , 1. ],
[ 0. , 0. , 0.06422193, 0.43830038],
[ 0. , 0. , 0. , 0.1034849 ],
[ 0. , 0. , 0. , 0. ]])
To see only interesting similarities.
See :
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction