I was getting MemoryError when I imported 100,000 documents to pairwise_distances function. For this reason, I sparsely calculated the distance matrix piece by piece and combined it finally. But AgglomerativeClustering does not take sparse matrix input. What can I do as an alternative?
####################
# SPARSE SIMILARITY MATRIX
parts = []
chunk_size = int(len(embeddings) // 10) + 1
for i in range(10):
print(i)
M = pairwise_distances(embeddings[i*chunk_size : (i+1)*chunk_size], embeddings, metric='cosine', n_jobs=-1)
M[M > 0.35] = 0
M = sparse.csr_array(M)
print(M.data.nbytes)
parts.append(M)
print('--------')
sm_matrix = sparse.vstack(parts)
del(parts)
print(sm_matrix.data.nbytes)
####################
print(sm_matrix)
####################
# Agglomerative Clustering
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1-similarity_threshold,affinity='precomputed',linkage=linkage)
clustering.fit(sm_matrix)
if verbose:
print('Clusters are calculated')
# clusters created
####################
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.