0

For several dfms, I have no problem converting them to stm/lda/topicmodels format. However, if I weight the dfms with dfm_tfidf() before converting, I get the following error:

Error in convert.dfm(users_dfm, to = "stm") : cannot convert a non-count dfm to a topic model format

Any idea why this might be? I've tried different weighting schemes for both term and document frequency (to try and make the weighted dfm a 'count' dfm), but I keep getting the error.

So, this works:

users_dfm <- dfm(users_tokens) 
users_stm <- convert(users_dfm, to = "stm")

But this doesn't:

users_dfm <- dfm(users_tokens)
weighted_dfm <- dfm_tfidf(users_dfm)
users_stm <- convert(weighted_dfm, to = "stm")

Thanks!

glts
  • 21,808
  • 12
  • 73
  • 94

1 Answers1

0

This is because topic models require counts as inputs, because that is the nature of the assumed statistical distribution for the latent Dirichlet allocation model. tf-idf weighting of the dfm turns the matrix into non-integer values, which are not valid inputs for stm (or any other topic model).

So in short, don't weight your dfm before using it with a topic model.

You should also note that conversion of a dfm to the stm format is not strictly required, since stm::stm() can take a dfm object directly as an input.

Ken Benoit
  • 14,454
  • 27
  • 50
  • Thanks, Ken. I asked because a recent paper in PolComm (link below, see page 25 under 'model estimation') explicitly mentions weighting a dfm by tf-idf in quanteda before running stm. I assumed the authors used convert(), but now I'm curious as to how this was done. Tf-idf is often mentioned as a best practice for topic models, but I personally haven't seen much improved performance. So I tried to add the weighting to some ongoing work, and came across the error. What you wrote makes sense though. https://www.tandfonline.com/doi/abs/10.1080/10584609.2020.1785067?journalCode=upcp20 – Michael Bossetta Aug 29 '20 at 00:45