How to remove the empty documents from the Document Term Matrix in R

Question

I have got empty documents in my document term matrix. I need to remove them. This is code that I used to build the DocumentTermMatrix:

 tweets_dtm_tfidf <- DocumentTermMatrix(tweet_corpus, control = list(weighting = weightTfIdf))

And this the warning Message that I am getting:

Warning message:
In weighting(x) :
  empty document(s): 823 3795 4265 7252 7295 7425 8240 8433 9303 12160 12278 14465 15166 15485 15933 20775 21666 21807 26131 27039 34035 34050 34101

I tried removing these empty documents using this code:

rowTotals <- apply(tweets_dtm_tfidf , 1, sum)
dtm_tfidf   <- tweets_dtm_tfidf[rowTotals> 0, ]

Here is the error that I am getting trying to remove them:

> rowTotals <- apply(tweets_dtm_tfidf , 1, sum)

Error: cannot allocate vector of size 6.8 Gb

Any idea on how to go about this? Thanks for any suggestions in advance.

score 0 · Accepted Answer · answered May 06 '18 at 11:59

0

The sum in apply transforms your sparse matrix into a dense matrix and this eats up a lot of memory if it is a big sparse matrix.

And the apply function is not needed. There are functions for sparse matrices. Since the dtm is a simple_triplet_matrix you can use the row_sums from slam.

The following should work.

rowTotals <- slam::row_sums(tweets_dtm_tfidf)
dtm_tfidf <- dtm_tfidf[rowTotals > 0, ]

But remember anything you do to get your data out of sparse matrix might result in big memory hog object if you have a lot of words. You might want to use removeSparseTerms before moving on.

answered May 06 '18 at 11:59

phiver

23,048
14
44
56

Just one more question I tried removing the sparse terms using the removeSparseTerms as you have mentioned. Here is my code: tweets_dtm_tfidf = removeSparseTerms(tweets_dtm_tfidf, 0.99) How do I determine the value? Like here I have used 0.99. I understand what that value stands for. Removes the values that doesn't appear more than a certain frequency of the number of times in all the documents combined. Just not sure how to determine the right value.? – AdeeThyag May 06 '18 at 12:45
You are welcome. And please accept answers on questions you asked. That helps in building your reputation. – phiver May 06 '18 at 12:50
Ah, the correct value is a bit of a guess. It depends on how many words you want to keep. The higher the number, the more words you keep. So 0.9 is everything that occurs 10% or more. – phiver May 06 '18 at 12:50
Thanks so much for that. Just accepted the answer. Didn't know about it. :) – AdeeThyag May 06 '18 at 13:01

How to remove the empty documents from the Document Term Matrix in R

1 Answers1