The converted dfm is dropping the "documents" that are empty, which probably happened because of feature removal via frequency trimming or pattern matching (such as removing stopwords). LDA cannot handle an empty document, so by default the empty documents are removed from the LDA formats ("topicmodels", "stm", etc.).
As of v1.5, there is an option in convert()
called omit_empty = TRUE
, which can be set to FALSE
if you want to keep zero-feature documents.
library("quanteda")
## Package version: 1.5.1
txt <- c("one two three", "and or but", "four five")
dfmat <- tokens(txt) %>%
tokens_remove(stopwords("en")) %>%
dfm()
dfmat
## Document-feature matrix of: 3 documents, 5 features (66.7% sparse).
## 3 x 5 sparse Matrix of class "dfm"
## features
## docs one two three four five
## text1 1 1 1 0 0
## text2 0 0 0 0 0
## text3 0 0 0 1 1
This is the difference that setting omit_empty = FALSE
creates:
# with and without the empty documents
convert(dfmat, to = "topicmodels")
## <<DocumentTermMatrix (documents: 2, terms: 5)>>
## Non-/sparse entries: 5/5
## Sparsity : 50%
## Maximal term length: 5
## Weighting : term frequency (tf)
convert(dfmat, to = "topicmodels", omit_empty = FALSE)
## <<DocumentTermMatrix (documents: 3, terms: 5)>>
## Non-/sparse entries: 5/10
## Sparsity : 67%
## Maximal term length: 5
## Weighting : term frequency (tf)
Finally, if you want to subset the dfm to remove empty documents, simply use dfm_subset()
. The second argument is coerced to a logical that will take the value of TRUE
when ntoken(dfmat) > 0
and FALSE
when 0.
# subset dfm to remove the empty documents
dfm_subset(dfmat, ntoken(dfmat))
## Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
## 2 x 5 sparse Matrix of class "dfm"
## features
## docs one two three four five
## text1 1 1 1 0 0
## text3 0 0 0 1 1