How can Topic Modeling noise be removed?

Question

I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words with high frequency that are useful . How can this noise be removed?

filtering by tf-idf score does not work well? – greeness Apr 21 '15 at 06:29 — greeness, Apr 21 '15 at 06:29
Or just use some common-words dictionary. – Vihari Piratla Apr 21 '15 at 10:03 — Vihari Piratla, Apr 21 '15 at 10:03

score 1 · Answer 1 · answered Apr 21 '15 at 08:07

1

LDA algorithms don't take tf-idf weights in input, but bag of words, however you could first filter words from your corpus based on their tf-idf score, and then feed the new texts to your LDA program.

answered Apr 21 '15 at 08:07

bendaizer

1,235
9
18

score 1 · Answer 2 · answered Apr 24 '15 at 19:11

Basic thing is that you do a TF-IDF and clean on scores, if that still doesnt help then you can create domain specific custom stopwords list. Suppose if I'm in a jobs domain, the word "job" is not a regular stopword but in jobs domain it is or the company name is a stopword since it repeats across many documents. So, building custom stopwords list is another way to go with.

How can Topic Modeling noise be removed?

2 Answers2