0

Currently I'm working with texts. My main goal is to calculate a measure of similarity between 30 000 texts. I'm following this tutorial:

Creating a document-term matrix:

In [1]: import numpy as np  # a conventional alias
In [2]: from sklearn.feature_extraction.text import CountVectorizer
In [3]: filenames = ['data/austen-brontë/Austen_Emma.txt',
...:              'data/austen-brontë/Austen_Pride.txt',
...:              'data/austen-brontë/Austen_Sense.txt',
...:              'data/austen-brontë/CBronte_Jane.txt',
...:              'data/austen-brontë/CBronte_Professor.txt',
...:              'data/austen-brontë/CBronte_Villette.txt']
...: 

In [4]: vectorizer = CountVectorizer(input='filename')


In [5]: dtm = vectorizer.fit_transform(filenames)  # a sparse matrix

In [6]: vocab = vectorizer.get_feature_names()  # a list



In [7]: type(dtm)
Out[7]: scipy.sparse.csr.csr_matrix

In [8]: dtm = dtm.toarray()  # convert to a regular array

In [9]: vocab = np.array(vocab)

Comparing texts And we want to use a measure of distance that takes into consideration the length of the novels, we can calculate the cosine similarity.

In [24]: from sklearn.metrics.pairwise import cosine_similarity

In [25]: dist = 1 - cosine_similarity(dtm)

In [26]: np.round(dist, 2)
Out[26]: 
array([[-0.  ,  0.02,  0.03,  0.05,  0.06,  0.05],
       [ 0.02,  0.  ,  0.02,  0.05,  0.04,  0.04],
       [ 0.03,  0.02,  0.  ,  0.06,  0.05,  0.05],
       [ 0.05,  0.05,  0.06,  0.  ,  0.02,  0.01],
       [ 0.06,  0.04,  0.05,  0.02, -0.  ,  0.01],
       [ 0.05,  0.04,  0.05,  0.01,  0.01, -0.  ]])

The final result:

enter image description here

As mentioned above, my goal is to calculate a measure of similarity between 30 000 texts. While implementing the above codes, it is taking too much time, in the end giving me a Memory Error. My question is --- Is there a better solution to calculate cosine similarity between huge number of texts? How do you cope with time and Memory Error Problem?

  • Things I notice without going into much careful analysis of the code or documentation: dtm seems to begin life as a sparse matrix which might occupy less space than it would after being converted to a 'regular array'. – Bill Bell Sep 22 '16 at 20:29
  • If you need only 2-digit precision could you arrange to make calculations resulting in signed two-digit integers and then define numpy array to use smaller data type? – Bill Bell Sep 22 '16 at 20:39
  • Defining dist seems unnecessary since it's such a simple function of cosine similarity. – Bill Bell Sep 22 '16 at 20:39
  • Is it really necessary to produce full matrix when it symmetrical? – Bill Bell Sep 22 '16 at 20:40
  • @BillBell, thank you for your time! My main goal is: in the end I want to have similarity measure between each text like shown in the image. Will arranging the results in two-digit integers fasten the process? Re full matrix, is there any other solutions besides making a matrix? – PineapplePizza Sep 23 '16 at 02:04
  • After I went on to other things I realised that I was commenting mainly about space considerations. I was thinking about likely source of your 'Memory Error'. Later it occurred to me to suggest that you sprinkle some time measurements in your code. (Discussions about how to do that are available in various places.) I'll think about how to solve the problem without matrices when I get a minute. – Bill Bell Sep 23 '16 at 17:51
  • I should have said this long before now: This is a lot of numbers. – Bill Bell Sep 23 '16 at 19:36
  • @BillBell Thank you for your comments. Yes, indeed. The only thing I'm thinking is, the real time measure one by one when it is needed is the best solution I guess )) – PineapplePizza Sep 26 '16 at 03:32

0 Answers0