3

I have this issue when I run this chunk of code

text_lda <- LDA(text_dtm, k = 2, method = "VEM", control = NULL)

I have the next mistake "Each row of the input matrix needs to contain at least one non-zero entry"

Then I tried to solve this with these lines

row_total = apply(text_dtm, 1, sum)
empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]]

But I got the next issue

cannot allocate vector of size 3890.8 GB

This is the size of my DTM:

DocumentTermMatrix documents: 1968850, terms: 265238
Non-/sparse entries: 29766814/522184069486
Sparsity           : 100%
Maximal term length: 4000
Weighting          : term frequency (tf)
user20650
  • 24,654
  • 5
  • 56
  • 91
coding
  • 917
  • 2
  • 12
  • 25
  • The problem is that `apply` converts your sparse matrix to a dense matrix , hence memory errors. You could see if there is a sparse matrix `rowSums` method instead of `apply` – user20650 Jan 17 '20 at 22:13
  • How did you generate the dtm? What package did you use? – captcoma Jan 18 '20 at 17:06
  • text_dtm <- DocumentTermMatrix(text_corpus_clean,control = list(tolower=TRUE,removePunctuation = TRUE, removeNumbers= TRUE,stopwords = TRUE,sparse=TRUE)) – coding Jan 19 '20 at 23:49

2 Answers2

2

Try this:

empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]] 
corpus_new <- corpus[-as.numeric(empty.rows)]

Or use tm to generate the dtm and then:

ui = unique(text_dtm$i)
text_dtm.new = text_dtm[ui,]
captcoma
  • 1,768
  • 13
  • 29
0

I’d recommend using a dgCMatrix class for your DTM. It ships with R as part of the widely-used Matrix package, works with topicmodels::LDA and many other NLP packages (textmineR, text2vec, tidytext, etc.), has methods that let you work with it as if it was a dense matrix.

library(tm)
library(topicmodels)
library(Matrix)

# grab a character vector of text. Your source may be different
text <- textmineR::nih_sample$ABSTRACT_TEXT

text_corpus <- SimpleCorpus(VectorSource(text))

text_dtm <- DocumentTermMatrix(text_corpus,
                               control = list(tolower=TRUE,
                                              removePunctuation = TRUE, 
                                              removeNumbers= TRUE,
                                              stopwords = TRUE,
                                              sparse=TRUE))

text_dtm2 <- cast_sparse(text_dtm)

text_dtm2 <- Matrix::sparseMatrix(i=text_dtm$i, 
                                  j=text_dtm$j,
                                  x=text_dtm$v, 
                                  dims=c(text_dtm$nrow, text_dtm$ncol), 
                                  dimnames = text_dtm$dimnames)

doc_lengths <- Matrix::rowSums(text_dtm2)

text_dtm3 <- text_dtm2[doc_lengths > 0, ]

text_lda <- LDA(text_dtm3,  k = 2, method = "VEM", control = NULL)
Tommy Jones
  • 380
  • 2
  • 10
  • I have the next mistake with this text_dtm2 <- cast_sparse(text_dtm) "no applicable method for 'ungroup' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')" – coding Jan 21 '20 at 18:41
  • I made this m <- Matrix::sparseMatrix(i=text_dtm$i, j=text_dtm$j, x=text_dtm$v, dims=c(text_dtm$nrow, text_dtm$ncol), dimnames = text_dtm$dimnames) and then I tried this text_dtm2 <- cast_sparse(text_dtm) but is not working – coding Jan 21 '20 at 19:07
  • That of course should work. But can I ask you to share your code for making text_dtm? I'd like to see for myself why cast_sparse wouldn't work. Thanks. – Tommy Jones Jan 21 '20 at 19:23
  • Yes of course! text_dtm <- DocumentTermMatrix(text_corpus_clean,control = list(tolower=TRUE,removePunctuation = TRUE, removeNumbers= TRUE,stopwords = TRUE,sparse=TRUE)) – coding Jan 21 '20 at 19:40
  • Thank you. I’ll look tonight then update my answer so that it works. – Tommy Jones Jan 22 '20 at 01:23
  • Code above should work in its entirety now. Turns out cast_sparse only works for an object that follows from unnest_tokens. – Tommy Jones Jan 22 '20 at 09:43