0

I have a big dataset of almost 90 columns and about 200k observations. One of the column contains descriptions, so it's only text. However, i have like 100 descriptions that are NAs.

I tried the code of Pablo Barbera from GitHub concerning Topic Models because i need it.

OUTPUT

library(topicmodels)
library(quanteda)

des <- subset(finalMSI, !is.na(description), select=c(description))
corpus_des <- corpus(des$description)
df_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
              remove_punct=TRUE, remove_numbers=TRUE)
cdes <- dfm_trim(df_des, min_docfreq = 2)

# estimate LDA with K topics
K <- 20
lda <- LDA(cdes, k = K, method = "Gibbs", 
           control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))

Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry

As i don't have any NA in my subset, i don't understand this error message (it's my first time using this package)

katdataecon
  • 185
  • 8
  • Can you provide a sample of the dataset with `dput(DATA)`? If it is a matrix, you might be subsetting elements and not rows. – ktiu Jun 03 '21 at 17:01
  • Try `na.omit` to remove the rows that after `subset`ting are NA. – Rui Barradas Jun 03 '21 at 17:04
  • @ktiu it gives me class = "data.frame" does it mean that it's not a matrix ? – katdataecon Jun 03 '21 at 17:06
  • @RuiBarradas I have the same error message with na.omit – katdataecon Jun 03 '21 at 17:08
  • @theeconomista Please provide a reproducible example so we can better understand the issue. – ktiu Jun 03 '21 at 17:11
  • @ktiu can i do a dput(df$description) or do i have to do dput(df) ? I mean, are the other columns also important ? I have to ask to my hierarchy before due to the professional secret. – katdataecon Jun 03 '21 at 17:18
  • @theeconomista Whatever you can share is helpful, especially if it reproduces the error. Also note that the error is about non-zero entries, and removing NA values may leave zeroes. – ktiu Jun 03 '21 at 17:23

1 Answers1

1

It looks like some of your documents are empty, in the sense that they contain no counts of any feature.

You can remove them with:

cdes <- dfm_trim(df_des, min_docfreq = 2) %>%
   dfm_subset(ntoken(cdes) > 0)
Ken Benoit
  • 14,454
  • 27
  • 50
  • Hi, do you know if there is something like dfm_subset(ntoken(cdes) > 0) for my core dataset ? Because when i do : try <- subset(finalMSI, !is.na(description)) ; try <- try %>% subset(ntoken(try) > 0) it doesn't work. But then i can't bind the columns you know between my df and the topics df i get with the code. – katdataecon Jun 08 '21 at 16:39
  • I mean, i merged by row number but i'm not sure that the result is accurate so i'm also looking for other solutions in case the merge is wrong. – katdataecon Jun 08 '21 at 16:42
  • I'm glad I could guess the solution to your original question in the absence of a reproducible example. But to answer the above would definitely require more. Suggest you create a new question and supply enough data to reproduce (and solve) the problem. – Ken Benoit Jun 08 '21 at 16:43