0

I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way).

# create document term matrix
dtm = create_dtm(it, vectorizer)

# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )

# normalise
dtm_averaged <-  text2vec::normalize(dtm[, common_terms], "l1")

# compute average sentence embeddings
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

The resulting object is of dgeMatrix class, which is equivalent to matrix class as I understand. dgeMatrix class isn't used for many downstream tasks so I would like to convert the matrix. The object, however, is 6GB large, and I have problems converting the matrix to a data frame or even text file for further processing.

Ideally , I'd use this matrix in Spark for further analysis such as k-means clustering. My question what would be the best strategy to use the matrix for downstream tasks.

a) Convert to matrix class or data frame b) write the matrix to file? c) something completely different

I run the models on Google Cloud and have a machine with 32gb ram and 28 cpu.

Thanks for your help.

0 Answers0