0

I was getting MemoryError when I imported 100,000 documents to pairwise_distances function. For this reason, I sparsely calculated the distance matrix piece by piece and combined it finally. But AgglomerativeClustering does not take sparse matrix input. What can I do as an alternative?

    ####################
    # SPARSE SIMILARITY MATRIX
    parts = []
    chunk_size = int(len(embeddings) // 10) + 1
    for i in range(10):
        print(i)
        M = pairwise_distances(embeddings[i*chunk_size : (i+1)*chunk_size], embeddings, metric='cosine', n_jobs=-1)
        M[M > 0.35] = 0
        M = sparse.csr_array(M)
        print(M.data.nbytes)
        parts.append(M)
        print('--------')
        
    sm_matrix = sparse.vstack(parts)
    del(parts)
    print(sm_matrix.data.nbytes)
    ####################
    
    print(sm_matrix)
    
    ####################
    # Agglomerative Clustering
    clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1-similarity_threshold,affinity='precomputed',linkage=linkage)
    clustering.fit(sm_matrix)
    if verbose:
        print('Clusters are calculated')
    # clusters created
    ####################

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Salihcan
  • 91
  • 13
  • 1
    Have you tried adding `.toarray()` in the line where you get the error in the correct place? – Caridorc Feb 05 '23 at 14:43
  • @Caridorc when i tried, i got "MemoryError: Unable to allocate 85.7 GiB for an array with shape (107278, 107278) and data type float64" – Salihcan Feb 05 '23 at 15:07
  • 1
    Ok, so one chance is making the array float32 to halve the memory requirements, but you will still need around 50 GB of RAM, another chance is just using a smaller amount of data. – Caridorc Feb 05 '23 at 20:11

0 Answers0