0

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library.

For the purpose of making reproduction of the problem easy, I create a vector with three documents, and impose 5 topics in the LDA model. The full code is as follows:


#install.packages("tm")
library("tm")
#install.packages("topicmodels")
library("topicmodels")

vector_of_speeches<- c("feder reserv commit use full rang tool support us economi challeng time therebi promot maxemploy pricest goal", "progress strong polici support indic economicact employ continu strengthen sector advers affect pandem improv recent month continu affect covid job gain solid recent month unemploymentr declin substanti suppli demand imbal relat pandem economi continu contribut elev level inflat overal financialcondit remain accommod part reflect polici measur support economi flow credit us household busi","path economi continu depend cours viru progress eas suppli expect support continu gain economicact employ reduct inflat risk economicoutlook remain includ new viru")

df <- as.data.frame(vector_of_speeches)

myCorpus <- Corpus(VectorSource(df$vector_of_speeches))
dtm <- TermDocumentMatrix(myCorpus)
inspect(dtm) # 3 documents and 68 different words




#LDA prep
burnin <- 4000
iter <- 4000
keep <- 50
k<-5
delta_gibbs <- 0.025
alpha_gibbs <- 50/k
seed=0

fomc_LDA <- LDA(dtm, k=k, method = "Gibbs", control = list(seed=seed, burnin = burnin, iter = iter, keep = keep))


str(as.matrix(posterior(fomc_LDA)$terms)) #dimension is 5 x 3, so the number of topics is being related with the number of documents

str(as.matrix(posterior(fomc_LDA)$topics)) #dimension is 68 x 5, so the number of unique words is being related with the number of documents

The functions that extracts the topic distribution per document is #topics, and the one which extracts vocabulary distribution per topics is $ terms. However, clearly they are inverted in the above code (the topic distribution is actually extracted from the $terms function). Why is this ocurring, and is it safe to use the topic distributions per document that are being returned by the $terms function?

When I use the full vector of documents (almost 2000), I tried to transpose the document term document, writing dtm <- t(dtm), but then, running the LDA model yields the following error:

Error in LDA(dtm, k = k, method = "Gibbs", control = list(seed = seed,  : 
  Each row of the input matrix needs to contain at least one non-zero entry

Why does this occur? Weird that the $topics and $terms functions seem inverted when it comes to the output they deliver, and I am not sure if I can thus rely on the $terms function to obtain the correct topic distributions per document(which is what I need).

Thomas GF
  • 1
  • 2
  • Are 20+ packages really necessary for a minimal reproducible example? Please remove anything that is unnecessary to help you with your issue. – MrFlick Aug 25 '22 at 18:42
  • Thank you for the feedback, I reduced it only to the necessary ones. – Thomas GF Aug 27 '22 at 01:48
  • I figured it out, post here in case anybody faces a similar problem. It turns out the correct line is "dtm <- TermDocumentMatrix(myCorpus)" instead of "dtm <- DocumentTermMatrix(myCorpus)", I didn't know both these functions existed, but they naturally exchange rows with columns (documents with words). Moreover, all documents have to contain at least one word. – Thomas GF Aug 27 '22 at 19:26

0 Answers0