0

I try to run an lda.

I have to convert it to an appropriate format using this

However with this, I don't know why I lose 2-3 documents from my initial input.

dtm <- convert(myDfm, to = "topicmodels")

As a result I can merge the topic with the initial data frame

I though I could use dfm but it is not acceptable format in lda()

toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = 1)

Unfortunately I can't provide an example input as it is around 30000 rows. If I test it to a small example of five rows the solution works fine.

Any suggestions?

Nathalie
  • 1,228
  • 7
  • 20

1 Answers1

1

The converted dfm is dropping the "documents" that are empty, which probably happened because of feature removal via frequency trimming or pattern matching (such as removing stopwords). LDA cannot handle an empty document, so by default the empty documents are removed from the LDA formats ("topicmodels", "stm", etc.).

As of v1.5, there is an option in convert() called omit_empty = TRUE, which can be set to FALSE if you want to keep zero-feature documents.

library("quanteda")
## Package version: 1.5.1

txt <- c("one two three", "and or but", "four five")

dfmat <- tokens(txt) %>%
  tokens_remove(stopwords("en")) %>%
  dfm()

dfmat
## Document-feature matrix of: 3 documents, 5 features (66.7% sparse).
## 3 x 5 sparse Matrix of class "dfm"
##        features
## docs    one two three four five
##   text1   1   1     1    0    0
##   text2   0   0     0    0    0
##   text3   0   0     0    1    1

This is the difference that setting omit_empty = FALSE creates:

# with and without the empty documents
convert(dfmat, to = "topicmodels")
## <<DocumentTermMatrix (documents: 2, terms: 5)>>
## Non-/sparse entries: 5/5
## Sparsity           : 50%
## Maximal term length: 5
## Weighting          : term frequency (tf)
convert(dfmat, to = "topicmodels", omit_empty = FALSE)
## <<DocumentTermMatrix (documents: 3, terms: 5)>>
## Non-/sparse entries: 5/10
## Sparsity           : 67%
## Maximal term length: 5
## Weighting          : term frequency (tf)

Finally, if you want to subset the dfm to remove empty documents, simply use dfm_subset(). The second argument is coerced to a logical that will take the value of TRUE when ntoken(dfmat) > 0 and FALSE when 0.

# subset dfm to remove the empty documents
dfm_subset(dfmat, ntoken(dfmat))
## Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
## 2 x 5 sparse Matrix of class "dfm"
##        features
## docs    one two three four five
##   text1   1   1     1    0    0
##   text3   0   0     0    1    1
Ken Benoit
  • 14,454
  • 27
  • 50