I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library.
For the purpose of making reproduction of the problem easy, I create a vector with three documents, and impose 5 topics in the LDA model. The full code is as follows:
#install.packages("tm")
library("tm")
#install.packages("topicmodels")
library("topicmodels")
vector_of_speeches<- c("feder reserv commit use full rang tool support us economi challeng time therebi promot maxemploy pricest goal", "progress strong polici support indic economicact employ continu strengthen sector advers affect pandem improv recent month continu affect covid job gain solid recent month unemploymentr declin substanti suppli demand imbal relat pandem economi continu contribut elev level inflat overal financialcondit remain accommod part reflect polici measur support economi flow credit us household busi","path economi continu depend cours viru progress eas suppli expect support continu gain economicact employ reduct inflat risk economicoutlook remain includ new viru")
df <- as.data.frame(vector_of_speeches)
myCorpus <- Corpus(VectorSource(df$vector_of_speeches))
dtm <- TermDocumentMatrix(myCorpus)
inspect(dtm) # 3 documents and 68 different words
#LDA prep
burnin <- 4000
iter <- 4000
keep <- 50
k<-5
delta_gibbs <- 0.025
alpha_gibbs <- 50/k
seed=0
fomc_LDA <- LDA(dtm, k=k, method = "Gibbs", control = list(seed=seed, burnin = burnin, iter = iter, keep = keep))
str(as.matrix(posterior(fomc_LDA)$terms)) #dimension is 5 x 3, so the number of topics is being related with the number of documents
str(as.matrix(posterior(fomc_LDA)$topics)) #dimension is 68 x 5, so the number of unique words is being related with the number of documents
The functions that extracts the topic distribution per document is #topics, and the one which extracts vocabulary distribution per topics is $ terms. However, clearly they are inverted in the above code (the topic distribution is actually extracted from the $terms function). Why is this ocurring, and is it safe to use the topic distributions per document that are being returned by the $terms function?
When I use the full vector of documents (almost 2000), I tried to transpose the document term document, writing dtm <- t(dtm), but then, running the LDA model yields the following error:
Error in LDA(dtm, k = k, method = "Gibbs", control = list(seed = seed, :
Each row of the input matrix needs to contain at least one non-zero entry
Why does this occur? Weird that the $topics and $terms functions seem inverted when it comes to the output they deliver, and I am not sure if I can thus rely on the $terms function to obtain the correct topic distributions per document(which is what I need).