1

The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel

However, I don't know how to get the probability of each document belongs to each topic as the example below:

    V1  V2  V3  V4
1   0.001025237 7.89E-05    7.89E-05    7.89E-05
2   0.002906977 0.002906977 0.014534884 0.002906977
3   0.003164557 0.003164557 0.003164557 0.003164557
4   7.21E-05    7.21E-05    0.000360334 7.21E-05
5   0.000804433 8.94E-05    8.94E-05    8.94E-05
6   5.63E-05    5.63E-05    5.63E-05    5.63E-05
7   0.001984127 0.001984127 0.001984127 0.001984127
8   0.003515625 0.000390625 0.000390625 0.000390625
9   0.000748503 0.000748503 0.003742515 0.003742515
10  0.000141723 0.00297619  0.000141723 0.000708617

This is the code for text2vec lda

ss2 <- as.character(stressor5$weibo)
seg2 <- mmseg4j(ss2)


# Create vocabulary. Terms will be unigrams (simple words).
it_test = itoken(seg2, progressbar = FALSE)
vocab2 <- create_vocabulary(it_test)


pruned_vocab2 = prune_vocabulary(vocab2, 
                                 term_count_min = 10, 
                                 doc_proportion_max = 0.5,
                                 doc_proportion_min = 0.001)


vectorizer2 <- vocab_vectorizer(pruned_vocab2)

dtm_test = create_dtm(it_test, vectorizer2)


lda_model = LDA$new(n_topics = 1000, vocabulary = vocab2, doc_topic_prior = 0.1, topic_word_prior = 0.01)

doc_topic_distr = lda_model$fit_transform(dtm_test, n_iter = 1000, convergence_tol = 0.01, check_convergence_every_n = 10)
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
Lucia
  • 615
  • 1
  • 9
  • 16

1 Answers1

6

doc_topic_distr is a matrix which contains number of how many times words from document were assigned to particular topic. So you need just to normalize each row by number of words (also you can add doc_topic_prior before normalization).

library(text2vec)
data("movie_review")
tokens = movie_review$review %>% 
  tolower %>% 
  word_tokenizer
# turn off progressbar because it won't look nice in rmd
it = itoken(tokens, ids = movie_review$id, progressbar = FALSE)
v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "lda_c")
doc_topic_prior = 0.1
lda_model = 
  LDA$new(n_topics = 10, vocabulary = v, 
          doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)

doc_topic_distr = 
  lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol = 0.01, 
                          check_convergence_every_n = 10)
head(doc_topic_distr)
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#5814_8   16   18    0   34    0   16   49    0   20    23
#2381_9    4    0    6   20    0    0    6    6    0    28
#7759_3   21   39    7    0    3   47    0   25   21    17
#3630_4   18    7   22   14   19    0   18    0    2    35
#9495_8    4    0   13   17   13   78    3    2   28    25
#8196_8    0    0    0   11    0    8    0    8    8     0
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
# or add norm first and normalize :
# doc_topic_prob = normalize(doc_topic_distr + doc_topic_prior, norm = "l1")
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks for developing and maintaining this terrific package! One question: is it also possible to get a matrix of top words by topics? Similar to how they show in page 15 of this paper [ https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf ]? – sriramn Mar 02 '17 at 18:37
  • 1
    Yes, I have it locally. Will push to github soon along with new lda algorithm - warplda. – Dmitriy Selivanov Mar 02 '17 at 19:18
  • 1
    As a quick workaround, are the values in `LDA$get_word_vectors()` the term frequencies by topics - and can I just sort them to get frequent words? I am still trying to wrap my head around the LDA class. – sriramn Mar 02 '17 at 19:23
  • 2
    something like `word_topic_counts = lda_model$get_word_vectors() topic_word_distr = t(word_topic_counts + topic_word_prior) %>% normalize('l1')` should work. Just sort each row - you will get top words for each topic. – Dmitriy Selivanov Mar 02 '17 at 20:13