3

So I'm trying to create similarity matrix of huge dataset whose dimension becomes 60000 x 60000 which is not possible to be stored in the even 25gb ram so I wanted to create the similarity scores separately with the dimension 1 x 60000 where i get the similarity score one article with all the rest. I'm using the count vectorizer for now and then computing cosine similarity. The example code can be seen below.

# First 20 values in the DataFrame First 20 values

And then I have created a CountVectorizer() object and fit the data in that object

count_vector = CountVectorizer()
vec = count_vector.fit_transform(df['final'][:20])
vec.shape

We get a matrix of shape (535,) and the object have following format:

<20x448 sparse matrix of type <class numpy.int64>
    with 535 stored elements in Compressed Sparse Row format>

Then we'll be fitting the data to get cosine similarity

cosine_sim = cosine_similarity(vec, vec)
cosine_sim.shape

The shape of our similarity matrix is 20 x 20.

It wont be a problem for smaller dataset but I have the whole dataset with 60,000 elements so the matrix which will be created will be of shape 60,000 x 60,000 which is not possible to be loaded at the same time in the memory.

Please help me with a way to find the similarity matrix with shape 1 x 60,000 which gives me the similarity of one element with all the rest and not all the elements with everyone else so that i can treat the data and use it.

Also I'm aware of the iterative method but I'm looking for a better method which can be used to treat all the 60,000 x 1 data at one go.

Yaboku
  • 202
  • 1
  • 2
  • 10

0 Answers0