5

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?

For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.

Ruby
  • 284
  • 1
  • 5
  • 18

2 Answers2

5

I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard . list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.

pnv
  • 1,437
  • 3
  • 23
  • 52
0

Worth to consider is that the stop words might not affect your model as much as you fear. Have you tried not removing them and compared the results?

See also this 2017 paper: "Pulling Out the Stops: Rethinking Stopword Removal for Topic Models." http://www.cs.cornell.edu/~xanda/stopwords2017.pdf

In conclusion they say (paraphrasing) that removing stopwords had no real negative effect on the quality of the LDA model, and if needed they could still be removed afterwards without impacting the model.

Alternatively you can always remove words with a high document frequency automatically, i.e. set a treshold of the amount of documents the word can appear in (e.g. 50%) and just remove all words that are more frequent than those as stopwords. I don't think this will meaningfully impact the model itself, but I'm sure it'll speed up the computations of the model, by virtue of there being less words to compute.

Joran Dox
  • 421
  • 1
  • 4
  • 14