4

I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with tokenized, stemmed descriptions for a number of apps in each of these two categories. The script then creates a tfidf matrix and attempts to produce a cosine similarity matrix.

I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy, and scikit-learn.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os

print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])

for i in cats:
    print ()
    print ('prepping', i)
    d2 = data[data.category == i]
    descStem = d2.descStem.tolist()

    print ('vectorizing', i)
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
    print (tfidf_matrix.shape)

    print ('calculating cosine sim', i)
    cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

The script works just fine for the smaller category of comics, with a tdidf_matrix.shape = (3119, 8217). However, I receive the error message below for the larger category of education, with a tfidf_matrix.shape = (90327, 62863). This matrix is larger than 2^32.

Traceback (most recent call last):

File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>

runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)

File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)

ValueError: could not convert integer scalar

I can overcome this error by running the code below, but using a dense matrix is a massive memory hog and I need to run this script on 40+ categories.

print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()

print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)

This is the closest similar issue I could find on StackOverflow, but I couldn't see out how it would help my situation...

rangus
  • 41
  • 2

0 Answers0