0

I use the following function to convert the topicmodels output to JSON output to use in ldavis.

topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
     ## Required packages
     library(topicmodels)
     library(dplyr)
     library(stringi)
     library(tm)
     library(LDAvis)

     ## Find required quantities
     phi <- posterior(fitted)$terms %>% as.matrix
     theta <- posterior(fitted)$topics %>% as.matrix
     vocab <- colnames(phi)
     doc_length <- vector()
     for (i in 1:length(corpus)) {
          temp <- paste(corpus[[i]]$content, collapse = ' ')
          doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
     }
     temp_frequency <- inspect(doc_term)
     freq_matrix <- data.frame(ST = colnames(temp_frequency),
                               Freq = colSums(temp_frequency))
     rm(temp_frequency)

     ## Convert to json
     json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
                                    vocab = vocab,
                                    doc.length = doc_length,
                                    term.frequency = freq_matrix$Freq)

     return(json_lda)
}

but I receive the following error

Error in LDAvis::createJSON(phi = phi, theta = theta, vocab = vocab, doc.length = doc_length, : Length of doc.length not equal to the number of rows in theta; both should be equal to the number of documents in the data.

Here is my complete code:

data <- read.csv("textmining.csv")


corpus <- Corpus(DataframeSource(data.frame(data$reasonforleaving))) 

# Remove punctuations and numbers because they are generally uninformative.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# Convert all words to lowercase.
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stopwords such as "a", "the", etc.
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Use the SnowballC package to do stemming.
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)


# remove extra words
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "still")
corpus <- tm_map(corpus, toSpace, "also")

# Remove excess white spaces between words.

corpus <- tm_map(corpus, stripWhitespace)
# Inspect the first document to see what it looks like.
corpus[[1]]$content 

dtm <- DocumentTermMatrix(corpus)

# remove empty documents
library(slam)
dtm = dtm[row_sums(dtm)>0,]

# Use topicmodels package to conduct LDA analysis.

burnin <- 500
iter <- 1000
keep <- 30
k <- 5

result55 <- LDA(dtm, 5)
ldaoutput = topicmodels_json_ldavis(result55,corpus, dtm)

Do you know why I receive the error?

Thanks

2 Answers2

6

I had the same issue with same code, and found this function here :

topicmodels2LDAvis <- function(x, ...){
    post <- topicmodels::posterior(x)
    if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
    mat <- x@wordassignments
    LDAvis::createJSON(
        phi = post[["terms"]], 
        theta = post[["topics"]],
        vocab = colnames(post[["terms"]]),
        doc.length = slam::row_sums(mat, na.rm = TRUE),
        term.frequency = slam::col_sums(mat, na.rm = TRUE)
    )
}

Much much simpler to use, just put as argument your LDA result :

result55 <- LDA(dtm, 5)
serVis(topicmodels2LDAvis(result55))
Léo Joubert
  • 522
  • 4
  • 17
0

Problem

Your problem is in for (i in 1:length(corpus)) in

 doc_length <- vector()
     for (i in 1:length(corpus)) {
          temp <- paste(corpus[[i]]$content, collapse = ' ')
          doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
     }

Remember, you have removed some "empty" documents from your DocumentTermMatrix in dtm = dtm[row_sums(dtm)>0,], so your vector length here is going to be too big.

Suggestion

You may want to keep a vector of the empty docs around as it will help you not only to generate the JSON but also to go back and forth between your empty and full document sets.
doc.length = colSums( as.matrix(tdm) > 0 )[!empty.docs]

My suggestion assumes you have the full tdm with empty docs in place

Eugene
  • 406
  • 2
  • 8