I have a document term matrix and I planned to perform NLP analysis on it. In the original code used, I defined thresholds for tf (term frequency)
and DF (document frequency)
to
- Remove some unnecessary word
- increase computational speed
So, I defined something like this:
#create DTM
dtm <- CreateDtm(tokens$clean_remark,
doc_names = tokens$ML..,
ngram_window = c(1, 2))
#explore the basic frequency
tf <- TermDocFreq(dtm = dtm)
original_tf <- tf %>% select(term, term_freq,doc_freq)
rownames(original_tf) <- 1:nrow(original_tf)
# Eliminate words appearing less than 350 times or in more than quarter of the
# documents
inds_vocabs = which( tf$term_freq > 350 & tf$doc_freq < nrow(dtm) / 4)
vocabulary <- tf$term[inds_vocabs]
dtm <- dtm[,inds_vocabs]
You can clearly see that Eliminate words appearing less than 350 times or in more than a quarter of the number of documents.
I thought that what I am actually doing here considering both term and document frequency, to find more important words, so I was thinking maybe I can use TF-IDF
instead of this. So I tried the following approach:
text_matrix <- text_cleaning_tokens %>% count(ML.., word) %>%
cast_dtm(document = ML.., term=word, value = n, weighting = tm::weightTf)
#removeSparseTerms(text_matrix, sparse = 0.999)
lda_model <- LDA(text_matrix, k=5, method = 'Gibbs', control = list(seed=12345))
When I ran this code I got this error message : Error in LDA(text_matrix, k = 5, method = "Gibbs", control = list(seed = 12345)) : The DocumentTermMatrix needs to have a term frequency weighting
I searched and found that LDA
function doesn't work with TF-IDF
. Here is the link: tf-idf document term matrix and LDA: Error messages in R
I understand that I can't use LDA
but then how I can use TF-IDF
for topic modelling? any alternative solution?