So I'm trying to create similarity matrix of huge dataset whose dimension becomes 60000 x 60000 which is not possible to be stored in the even 25gb ram so I wanted to create the similarity scores separately with the dimension 1 x 60000 where i get the similarity score one article with all the rest. I'm using the count vectorizer for now and then computing cosine similarity. The example code can be seen below.
# First 20 values in the DataFrame
And then I have created a CountVectorizer()
object and fit the data in that object
count_vector = CountVectorizer()
vec = count_vector.fit_transform(df['final'][:20])
vec.shape
We get a matrix of shape (535,) and the object have following format:
<20x448 sparse matrix of type <class numpy.int64>
with 535 stored elements in Compressed Sparse Row format>
Then we'll be fitting the data to get cosine similarity
cosine_sim = cosine_similarity(vec, vec)
cosine_sim.shape
The shape of our similarity matrix is 20 x 20.
It wont be a problem for smaller dataset but I have the whole dataset with 60,000 elements so the matrix which will be created will be of shape 60,000 x 60,000 which is not possible to be loaded at the same time in the memory.
Please help me with a way to find the similarity matrix with shape 1 x 60,000 which gives me the similarity of one element with all the rest and not all the elements with everyone else so that i can treat the data and use it.
Also I'm aware of the iterative method but I'm looking for a better method which can be used to treat all the 60,000 x 1 data at one go.