Convert dgeMatrix for downstream tasks

Question

I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way).

# create document term matrix
dtm = create_dtm(it, vectorizer)

# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )

# normalise
dtm_averaged <-  text2vec::normalize(dtm[, common_terms], "l1")

# compute average sentence embeddings
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

The resulting object is of dgeMatrix class, which is equivalent to matrix class as I understand. dgeMatrix class isn't used for many downstream tasks so I would like to convert the matrix. The object, however, is 6GB large, and I have problems converting the matrix to a data frame or even text file for further processing.

Ideally , I'd use this matrix in Spark for further analysis such as k-means clustering. My question what would be the best strategy to use the matrix for downstream tasks.

a) Convert to matrix class or data frame b) write the matrix to file? c) something completely different

I run the models on Google Cloud and have a machine with 32gb ram and 28 cpu.

Thanks for your help.

When I try converting the matrix I run out of memory.. I tried increasing the memory, without any luck... — user2300301, Jan 04 '18 at 11:24
I added even more memory ( > 200gb) and can now convert into a matrix / data frame. Thx — user2300301, Jan 04 '18 at 14:34
This looks very strange. I will try to replicate. My guess is that `as.matrix` should not consume more than 2x of original `dgeMatrix`. — Dmitriy Selivanov, Jan 04 '18 at 16:44

Convert dgeMatrix for downstream tasks

0 Answers0