0
library (text2vec)
library (parallel)
library (doParallel)

N <- parallel::detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)
Ky_young <- read.csv("./Ky_young.csv")

IT <- itoken_parallel (Ky_young$TEXTInfo,
                       ids          = Ky_young$ID,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)

##stopword
stop_words = readLines("./stopwrd1.txt", encoding="UTF-8")

VOCAB <- create_vocabulary (
        IT, stopwords = stop_words
        ngram = c(1, 1)) %>%
        prune_vocabulary (term_count_min = 5)


VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT, VECTORIZER, distributed = F)


LDA_MODEL <- 
      LatentDirichletAllocation$new (n_topics         = 200,
                                     #vocabulary       = VOCAB, <= ERROR
                                     doc_topic_prior  = 0.1,  
                                     topic_word_prior = 0.01) 


##topic-document distribution
LDA_FIT <- LDA_MODEL$fit_transform (
        x = DTM, 
        n_iter = 50, 
        convergence_tol = -1, 
        n_check_convergence = 10)

#topic-word distribution
topic_word_prior = LDA_MODEL$topic_word_distribution

I create the test LDA code in text2vec, and I can get the word-topic distribution and document-topic distribution. (and It was crazy fast)

By the way, I wondering is it possible to get the topic distribution for each tokens in document from text2vec's LDA model?

I understand that LDA analysis process result is each tokens in document belong to specific topics, and so each document has topics distribution.

If I can get the each token's topic distribution, I like to check the each topic's top word changes by classfified documents(like period). Is it possible?

If there are another way, I would be very grateful let me know.

유승환
  • 129
  • 1
  • 1
  • 10
  • Topic-word assignments are in `LDA_MODEL$components`. Is it what are you looking for? – Dmitriy Selivanov Sep 11 '17 at 11:06
  • If I can match the LDA_MODEL$components result with raw document set, I can find out the each tokens topics in document. I saw the option you said when I tested your package. But I fail to match with raw document set. for example, I try to see the words belong to first~100 document in LDA_MODEL$components result. is it possible? – 유승환 Sep 11 '17 at 11:59
  • Not sure I understand what are you trying to achieve. Could you provide example (update question)? (not code, just describe you use case) – Dmitriy Selivanov Sep 11 '17 at 12:21
  • 1
    As I understand,the distribution of topics is due to the terms distributed in the document being assigned to a particular topic. So The distribution of the entire topics is the sum of the terms assigned to that topic.(Is it correct..?) – 유승환 Sep 11 '17 at 12:50
  • And the LDA model created by the topic modeling analysis targets the entire text that was used for the analysis. I now assume it is a diary text. I split the diary data into a year and write it down in the document title. I want to see the topics distribution by period, but I also want to see the changes in the temrs that make up the topics. – 유승환 Sep 11 '17 at 12:51
  • If the distribution of the topic is the sum of the terms assigned to the topic, Since all the terms belong to the document, it will be possible to calculate the distribtuion of the topics through the sum of the terms belonging th to documents that match the period I want to see. I think this will bring about a change in the terms that makes up the topic.(like Top terms changes in topics-LDA_MODEL$get_top_words) – 유승환 Sep 11 '17 at 12:51
  • sorry for my terrible english. Thank you for your attention. – 유승환 Sep 11 '17 at 12:51

1 Answers1

1

Unfortunately it is impossible to get distribution of topics for each token in a given document. Document-topic counts are calculated/aggregated "on the fly", so document-token-topic distribution is not stored anywhere.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • 1
    Thank you for your answering. It is very clear that I hear the answer from the original package maker. Thank you for save my time :) – 유승환 Sep 12 '17 at 04:26