I am trying to create two Document Term Matrices like so:
title_train <- DocumentTermMatrix(title_corpus_train, control = list(dictionary = title_dict))
title_test <- DocumentTermMatrix(title_corpus_test, control = list(dictionary = title_dict))
The first one has 75k rows and the second has 25k rows. Since I have created these, my memory usage is nearly maxed out at 7gb.
I would like to speed work with these matrices in a more efficient way...
I have considered two possibilities, but I am not sure how to implement either of them:
- Convert the DocumentTermMatrix to a data.table
- Use
ff
package to store them asffdf
Can anyone provide any guidance or examples of how I can speed up working with a large DocumentTermMatrix?
Ultimately, I would like to be able to support over 3m rows (I am currently only using a subset of 100k).