2

It's the first time that I post on Stackoverflow, I'm a student. I hope someone will be able to help me. I am trying to do sentiment analysis in R Studio and am facing vector size error:

When I try to create a Document Term Matrix using this code:

dtm2 <- as.matrix(dtm) 

I get the error "Error: cannot allocate vector of size 38.3 Gb".

The dtm is the DocumentTermMatrix of a corpus that has 178884 elements, 26.7 Mb and the text consist of reviews.

I read all the other response on StackOverflow but I do not understand them and probably they don't apply to my issue. How can I increase the size of a vector in R Studio? I am using RStudio Version 1.2.5001 on a Windows 64 machine.

Is there any other information to provide?

Phil
  • 7,287
  • 3
  • 36
  • 66
Magnon
  • 21
  • 1

1 Answers1

0

A document-term matrix (the generic object, not the class in the tm package) is typically a count of the number of times a word (in a column) occur in a document (in a row). The columns are the vocabulary for the entire corpus. In the OPs case, there are 178,884 unique words in the vocabulary. What this means is that for each row there are lots of zeros -- the matrix is very sparse.

In most text analysis packages, the DTM is represented using a special kind of matrix that does not allocate memory for those zeros. For example, in the tm package the DTM is actually a simple_triplet_matrix from the slam package and in quanteda it is a sparseMatrix from the Matrix package.

Now, if we take either special class of matrix and convert it to a base R matrix using as.matrix(), R is now having to allocate memory for every cell in the matrix including those zeros. This why a DTM that is only a few dozen Mbs as a simple_triplet_matrix or a sparseMatrix will be a few dozen Gbs as a base R "dense" matrix.

The solution, then is to use tools from the slam or the Matrix package on the DTM instead of using as.matrix(). These packages have all the typical matrix operation functions, but specific to sparse matrix classes. For example, to use rowsum in base R would be row_sums() in the slam package or rowSums() in the Matrix package.

Dustin Stoltz
  • 40
  • 1
  • 5