Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry

Question

I have a big dataset of almost 90 columns and about 200k observations. One of the column contains descriptions, so it's only text. However, i have like 100 descriptions that are NAs.

I tried the code of Pablo Barbera from GitHub concerning Topic Models because i need it.

OUTPUT

library(topicmodels)
library(quanteda)

des <- subset(finalMSI, !is.na(description), select=c(description))
corpus_des <- corpus(des$description)
df_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
              remove_punct=TRUE, remove_numbers=TRUE)
cdes <- dfm_trim(df_des, min_docfreq = 2)

# estimate LDA with K topics
K <- 20
lda <- LDA(cdes, k = K, method = "Gibbs", 
           control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))

Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry

As i don't have any NA in my subset, i don't understand this error message (it's my first time using this package)

Can you provide a sample of the dataset with `dput(DATA)`? If it is a matrix, you might be subsetting elements and not rows. — ktiu, Jun 03 '21 at 17:01
Try `na.omit` to remove the rows that after `subset`ting are NA. — Rui Barradas, Jun 03 '21 at 17:04
@ktiu it gives me class = "data.frame" does it mean that it's not a matrix ? — katdataecon, Jun 03 '21 at 17:06
@theeconomista Please provide a reproducible example so we can better understand the issue. — ktiu, Jun 03 '21 at 17:11
@ktiu can i do a dput(df$description) or do i have to do dput(df) ? I mean, are the other columns also important ? I have to ask to my hierarchy before due to the professional secret. — katdataecon, Jun 03 '21 at 17:18
@theeconomista Whatever you can share is helpful, especially if it reproduces the error. Also note that the error is about non-zero entries, and removing NA values may leave zeroes. — ktiu, Jun 03 '21 at 17:23

score 1 · Accepted Answer · answered Jun 04 '21 at 06:53

1

It looks like some of your documents are empty, in the sense that they contain no counts of any feature.

You can remove them with:

cdes <- dfm_trim(df_des, min_docfreq = 2) %>%
   dfm_subset(ntoken(cdes) > 0)

answered Jun 04 '21 at 06:53

Ken Benoit

14,454
27
50

Hi, do you know if there is something like dfm_subset(ntoken(cdes) > 0) for my core dataset ? Because when i do : try <- subset(finalMSI, !is.na(description)) ; try <- try %>% subset(ntoken(try) > 0) it doesn't work. But then i can't bind the columns you know between my df and the topics df i get with the code. – katdataecon Jun 08 '21 at 16:39
I mean, i merged by row number but i'm not sure that the result is accurate so i'm also looking for other solutions in case the merge is wrong. – katdataecon Jun 08 '21 at 16:42
I'm glad I could guess the solution to your original question in the absence of a reproducible example. But to answer the above would definitely require more. Suggest you create a new question and supply enough data to reproduce (and solve) the problem. – Ken Benoit Jun 08 '21 at 16:43

Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry

1 Answers1