Each row of the input matrix needs to contain at least one non-zero entry

Question

I have this issue when I run this chunk of code

text_lda <- LDA(text_dtm, k = 2, method = "VEM", control = NULL)

I have the next mistake "Each row of the input matrix needs to contain at least one non-zero entry"

Then I tried to solve this with these lines

row_total = apply(text_dtm, 1, sum)
empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]]

But I got the next issue

cannot allocate vector of size 3890.8 GB

This is the size of my DTM:

DocumentTermMatrix documents: 1968850, terms: 265238
Non-/sparse entries: 29766814/522184069486
Sparsity           : 100%
Maximal term length: 4000
Weighting          : term frequency (tf)

The problem is that `apply` converts your sparse matrix to a dense matrix , hence memory errors. You could see if there is a sparse matrix `rowSums` method instead of `apply` — user20650, Jan 17 '20 at 22:13
text_dtm <- DocumentTermMatrix(text_corpus_clean,control = list(tolower=TRUE,removePunctuation = TRUE, removeNumbers= TRUE,stopwords = TRUE,sparse=TRUE)) — coding, Jan 19 '20 at 23:49

captcoma · Answer 1 · 2020-01-18T17:24:44.767

2

Try this:

empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]] 
corpus_new <- corpus[-as.numeric(empty.rows)]

Or use tm to generate the dtm and then:

ui = unique(text_dtm$i)
text_dtm.new = text_dtm[ui,]

edited Jan 18 '20 at 17:24

answered Jan 18 '20 at 17:15

captcoma

1,768
13
29

Tommy Jones · Answer 2 · 2020-01-22T09:42:39.333

0

I’d recommend using a dgCMatrix class for your DTM. It ships with R as part of the widely-used Matrix package, works with topicmodels::LDA and many other NLP packages (textmineR, text2vec, tidytext, etc.), has methods that let you work with it as if it was a dense matrix.

library(tm)
library(topicmodels)
library(Matrix)

# grab a character vector of text. Your source may be different
text <- textmineR::nih_sample$ABSTRACT_TEXT

text_corpus <- SimpleCorpus(VectorSource(text))

text_dtm <- DocumentTermMatrix(text_corpus,
                               control = list(tolower=TRUE,
                                              removePunctuation = TRUE, 
                                              removeNumbers= TRUE,
                                              stopwords = TRUE,
                                              sparse=TRUE))

text_dtm2 <- cast_sparse(text_dtm)

text_dtm2 <- Matrix::sparseMatrix(i=text_dtm$i, 
                                  j=text_dtm$j,
                                  x=text_dtm$v, 
                                  dims=c(text_dtm$nrow, text_dtm$ncol), 
                                  dimnames = text_dtm$dimnames)

doc_lengths <- Matrix::rowSums(text_dtm2)

text_dtm3 <- text_dtm2[doc_lengths > 0, ]

text_lda <- LDA(text_dtm3,  k = 2, method = "VEM", control = NULL)

edited Jan 22 '20 at 09:42

answered Jan 20 '20 at 15:58

Tommy Jones

380
2
10

I have the next mistake with this text_dtm2 <- cast_sparse(text_dtm) "no applicable method for 'ungroup' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')" – coding Jan 21 '20 at 18:41
I made this m <- Matrix::sparseMatrix(i=text_dtm$i, j=text_dtm$j, x=text_dtm$v, dims=c(text_dtm$nrow, text_dtm$ncol), dimnames = text_dtm$dimnames) and then I tried this text_dtm2 <- cast_sparse(text_dtm) but is not working – coding Jan 21 '20 at 19:07
That of course should work. But can I ask you to share your code for making text_dtm? I'd like to see for myself why cast_sparse wouldn't work. Thanks. – Tommy Jones Jan 21 '20 at 19:23
Yes of course! text_dtm <- DocumentTermMatrix(text_corpus_clean,control = list(tolower=TRUE,removePunctuation = TRUE, removeNumbers= TRUE,stopwords = TRUE,sparse=TRUE)) – coding Jan 21 '20 at 19:40
Thank you. I’ll look tonight then update my answer so that it works. – Tommy Jones Jan 22 '20 at 01:23
Code above should work in its entirety now. Turns out cast_sparse only works for an object that follows from unnest_tokens. – Tommy Jones Jan 22 '20 at 09:43

Each row of the input matrix needs to contain at least one non-zero entry

2 Answers2