3

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package.

Following the example in the following link I created a topic model like so:

Gibbs = LDA(JSS_dtm, k = 4, 
            method = "Gibbs",
            control = list(seed = 1, burnin = 1000, thin = 100, iter = 1000))

we can then get the topics using topics(Gibbs,1), terms using terms(Gibbs,10) and even the topic probabilities using Gibbs@gamma, but after looking at str(Gibbs) it appears that there is no way to get term probabilities within each topic. This would be useful because topic 1 could be 50% term A and 50% term B, while topic 2 can be 90% Term C and 10% term D. I'm aware that tools like MALLET and Python's NLTK module offer this capability, but I was also hoping that a similar solution may exist in R.

If anyone know how this can be achieved, please let us know.

Many thanks!

EDIT:

For the benefit of the others, I thought I'd share my current workaround. If I knew term probabilities, I'd be able to visualise them and give the viewer a better understanding of what each topic means, but without the probabilities, I'm simply breaking down my data by each topic and creating a word cloud for each topic using binary weights. While these values are not probabilities, they give an indication of what each topic focuses on.

See the below code:

JSS_text   <- sapply(1:length(JSS_papers[,"description"]), function(x) unlist(JSS_papers[x,"description"]))
jss_df     <- data.frame(text=JSS_text,topic=topics(Gibbs, 1))
jss_dec_df <- data.frame()

for(i in unique(topics(Gibbs, 1))){
  jss_dec_df <- rbind(jss_dec_df,data.frame(topic = i, 
                                            text = paste(jss_df[jss_df$topic==i,"text"],collapse=" ")))
}

corpus <- Corpus(VectorSource(jss_dec_df$text))
JSS_dtm <- TermDocumentMatrix(corpus,control = list(stemming = TRUE, 
                                                    stopwords = TRUE, 
                                                    minWordLength = 3,
                                                    removeNumbers = TRUE, 
                                                    removePunctuation = TRUE,
                                                    function(x)weightSMART(x,spec="bnc")))

(JSS_dtm  = removeSparseTerms(JSS_dtm,0.1)) # not the sparsity parameter 

library(wordcloud)
comparison.cloud(as.matrix(JSS_dtm),random.order=F,max.words=100, 
                 scale=c(6,0.6),colours=4,title.size=2)
IVR
  • 1,718
  • 2
  • 23
  • 41

1 Answers1

4

Figured it out -- to get the term weights, use posterior(lda_object)$terms. Turned out to be much easier than I thought!

IVR
  • 1,718
  • 2
  • 23
  • 41