Performance Issues with DocumentTermMatrix

Question

I am trying to create two Document Term Matrices like so:

title_train <- DocumentTermMatrix(title_corpus_train, control = list(dictionary = title_dict))
title_test <- DocumentTermMatrix(title_corpus_test, control = list(dictionary = title_dict))

The first one has 75k rows and the second has 25k rows. Since I have created these, my memory usage is nearly maxed out at 7gb.

I would like to speed work with these matrices in a more efficient way...

I have considered two possibilities, but I am not sure how to implement either of them:

Convert the DocumentTermMatrix to a data.table
Use ff package to store them as ffdf

Can anyone provide any guidance or examples of how I can speed up working with a large DocumentTermMatrix?

Ultimately, I would like to be able to support over 3m rows (I am currently only using a subset of 100k).

Do you still need to use any functions that require DocumentTermMatrix objects? What else do you plan to do with the data. Normally the DocumentTermMatrices are already sparse matricies so they shouldn't be too unnecessarily large. Is there a reasonable away to reduce the number of words you are tracking? Have you stemmed the content or required a minimum/maximum number of characters? — MrFlick, Jul 10 '14 at 17:38
@MrFlick I have already reduced it as much as possible I believe. I plan to pass the DTM into the following methods from the e1071 package, `naiveBayes()` and `predict()`. — user1477388, Jul 10 '14 at 17:44
@MrFlick Is it possible to convert the DTM to a data.table and then pass that to my `naiveBayes()` and `predict()` methods? — user1477388, Jul 10 '14 at 18:55
According to the documentation `naiveBayes()` takes a data.frame so it will be compatible with a data.table as well. Same should go for the `predict()` when used on the result. I don't know if `ff` inherits from data.frame or not. — MrFlick, Jul 10 '14 at 19:00
@MrFlick I see, that sounds promising. So, the only piece left then is to convert the DTM to a data.table. Any idea if that's possible/ how to do it? — user1477388, Jul 10 '14 at 19:01
@MrFlick I have found this http://stackoverflow.com/questions/12029177/sparse-matrix-to-a-data-frame-in-r which shows me how to convert a sparse matrix (which is what a DTM is, I believe) to a data.frame. Is this what I should do in this case? Just looking for a way to stop my RAM from getting maxed out. — user1477388, Jul 10 '14 at 20:12
It depends on the shape you want your data to be in. You will take a RAM hit on conversion because all the data in the DTM needs to be copied into the data.table so it will exist in both places. If you just want tuples of (doc, word, freq) as would be produced by that method, you can do fewer documents at a time since the procedure is essentially cumulative. — MrFlick, Jul 10 '14 at 20:32
@MrFlick I am not sure what the answer to this question is. Is there no way to increase the efficiency of my application which **1)** Receives CSV data into a data.table **2)** Converts the data into a corpus and DTM **3)** Runs naiveBayes() and predict() and returns the results. With over 100k rows my machine runs out of memory. I am looking for a way to apply either data.table or ff or some other method to resolve that. — user1477388, Jul 11 '14 at 13:14
@MrFlick Thank you. You were correct. I had set my `title_dict <- c(findFreqTerms(title_dtm_train, 5))` far too low. I have set it higher and it seem to have increased performance (of both memory and model). I still need to support a dataset several times larger than 100k, so I am still not sure how to do that... — user1477388, Jul 11 '14 at 16:55

Performance Issues with DocumentTermMatrix

0 Answers0