0

I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words with high frequency that are useful . How can this noise be removed?

2 Answers2

1

LDA algorithms don't take tf-idf weights in input, but bag of words, however you could first filter words from your corpus based on their tf-idf score, and then feed the new texts to your LDA program.

bendaizer
  • 1,235
  • 9
  • 18
1

Basic thing is that you do a TF-IDF and clean on scores, if that still doesnt help then you can create domain specific custom stopwords list. Suppose if I'm in a jobs domain, the word "job" is not a regular stopword but in jobs domain it is or the company name is a stopword since it repeats across many documents. So, building custom stopwords list is another way to go with.

Praveen
  • 338
  • 2
  • 11