How to use `TF-IDF` with `LDA` function without getting error

Question

I have a document term matrix and I planned to perform NLP analysis on it. In the original code used, I defined thresholds for tf (term frequency) and DF (document frequency) to

Remove some unnecessary word
increase computational speed

So, I defined something like this:

#create DTM
dtm <- CreateDtm(tokens$clean_remark, 
                 doc_names = tokens$ML.., 
                 ngram_window = c(1, 2))
#explore the basic frequency
tf <- TermDocFreq(dtm = dtm)
original_tf <- tf %>% select(term, term_freq,doc_freq)
rownames(original_tf) <- 1:nrow(original_tf)

# Eliminate words appearing less than 350 times or in more than quarter of the
# documents
inds_vocabs = which( tf$term_freq > 350 & tf$doc_freq < nrow(dtm) / 4)
vocabulary <- tf$term[inds_vocabs]
dtm <- dtm[,inds_vocabs]

You can clearly see that Eliminate words appearing less than 350 times or in more than a quarter of the number of documents.

I thought that what I am actually doing here considering both term and document frequency, to find more important words, so I was thinking maybe I can use TF-IDF instead of this. So I tried the following approach:

text_matrix <- text_cleaning_tokens %>% count(ML.., word) %>% 
  cast_dtm(document = ML.., term=word, value = n, weighting = tm::weightTf)
#removeSparseTerms(text_matrix, sparse = 0.999)

lda_model <- LDA(text_matrix, k=5, method = 'Gibbs', control = list(seed=12345))

When I ran this code I got this error message : Error in LDA(text_matrix, k = 5, method = "Gibbs", control = list(seed = 12345)) : The DocumentTermMatrix needs to have a term frequency weighting

I searched and found that LDA function doesn't work with TF-IDF. Here is the link: tf-idf document term matrix and LDA: Error messages in R I understand that I can't use LDA but then how I can use TF-IDF for topic modelling? any alternative solution?

Takuro Ikeda · Answer 1 · 2021-06-18T08:42:58.353

I think this code refer to this URL is using tf-idf with LDA. (Sorry , I'm Japanese and bad English) https://bookdown.org/yann_ryan/r-for-newspaper-data/topic-modelling.html#topic-modelling-with-the-library-topicmodels

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(topicmodels)
library(LDAvis)

# Read Data
# Data is splited to words from text
df <- read.csv("tests/テキストマイニング分解後処理済_2106181431.csv"
               ,fileEncoding = "cp932",stringsAsFactors = F)

# count words grouped by text
mybook_words <- df %>% rename(book = テキスト,word=単語) %>% 
  select(book,word) %>% group_by(book,word) %>% 
  summarise(n = n()) %>% ungroup()

# calculate tf-idf
mybook_words2 <- mybook_words %>% bind_tf_idf(word,book,n) %>% 
  select(book,word,tf_idf,n) 

# filter low tf-idf words
# mybook_words3 <- mybook_words2 %>% filter(tf_idf > 0.5) 
# make DTM
# dtm_long <- mybook_words3 %>% cast_dtm(book, word, n)

# weight n by tf-idf
mybook_words4 <- mybook_words2 %>% mutate(n2 = round(tf_idf * n * 10,0)) %>% 
  filter(n2 > 0)
# make DTM using n weighted by tf-idf
dtm_long <- mybook_words4 %>% cast_dtm(book, word, n2)

# execute LDA
lda_model_long_1 <- LDA(dtm_long,k = 3, control = list(seed = 1234))
result <- tidytext::tidy(lda_model_long_1, 'beta')

# top words of topics
ldaOut.terms <- as.matrix(terms(lda_model_long_1,6))
ldaOut.terms[1:6,]

# View result in shinyapp
topicmodels2LDAvis <- function(x, ...){
  post <- topicmodels::posterior(x)
  if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
  mat <- x@wordassignments
  LDAvis::createJSON(
    phi = post[["terms"]], 
    theta = post[["topics"]],
    vocab = colnames(post[["terms"]]),
    doc.length = slam::row_sums(mat, na.rm = TRUE),
    term.frequency = slam::col_sums(mat, na.rm = TRUE)
  )
}
serVis(topicmodels2LDAvis(lda_model_long_1))

# Visualise result
result %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>% 
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  coord_flip()

How to use `TF-IDF` with `LDA` function without getting error

1 Answers1