I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?
-
1Could you create a minimal reproducible example for us? Same process you are using but maybe with some dummy documents. – phiver May 29 '18 at 17:48
-
I cannot think of example data that lose documents when transforming a dfm to a topicmodels object, unfortunately. – fritsvegters May 29 '18 at 18:04
-
You are losing them now. So if you take one of the quanteda example datasets like data_char_ukimmig2010 or something and use this in your code, for example `my_corpus <- corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010)))`. Then follow your code and see where docvars are lost and add that to your post. – phiver May 29 '18 at 18:16
3 Answers
I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels
and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
docvars(dfm_speeches)
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm
and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.

- 23,048
- 14
- 44
- 56
-
I'm sorry, I picked the wrong words. You're right. In your solution the no. of documents also changed from 14 to 4, right? Do you mean that the dimnames appear and not the docvars? – fritsvegters May 29 '18 at 19:38
-
Note: `dfm_select()` here results in a dfm with just one feature, "cowen". The conversion to the topicmodels format drops documents with zero feature counts, since these cause a problem for `topicmodels::LDA()`. As @phiver points out, conversion `to = "tm"` will not drop the empty documents. (But if you are having problems with empty documents, you have deeper problems if you are trying to fit topic models!). – Ken Benoit May 30 '18 at 07:18
-
Thanks for responding. I did not use dfm_select in my own code, but I was trying to come up with a reproducible example that drops documents when converted to topicmodels. Could you explain what you mean by deeper problems? The point is that with my data I have empty documents. This wasn't a problem for the LDA package, whereas with topicmodels it is. However, with Kohei's solution I can now figure out which document is which. – fritsvegters May 30 '18 at 13:07
-
By deeper problems I simply meant that you are going to have trouble fitting any model to a matrix where some “documents” have zero features. – Ken Benoit May 31 '18 at 07:57
I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]

- 750
- 3
- 6
Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
dfmlda
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.

- 57
- 8