I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with tokenized, stemmed descriptions for a number of apps in each of these two categories. The script then creates a tfidf matrix and attempts to produce a cosine similarity matrix.
I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy, and scikit-learn.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])
for i in cats:
print ()
print ('prepping', i)
d2 = data[data.category == i]
descStem = d2.descStem.tolist()
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
print (tfidf_matrix.shape)
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
The script works just fine for the smaller category of comics, with a tdidf_matrix.shape = (3119, 8217). However, I receive the error message below for the larger category of education, with a tfidf_matrix.shape = (90327, 62863). This matrix is larger than 2^32.
Traceback (most recent call last):
File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>
runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)
ValueError: could not convert integer scalar
I can overcome this error by running the code below, but using a dense matrix is a massive memory hog and I need to run this script on 40+ categories.
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)
This is the closest similar issue I could find on StackOverflow, but I couldn't see out how it would help my situation...