17

In NLP, stop-words removal is a typical pre-processing step. And it is typically done in an empirical way based on what we think stop-words should be.

But in my opinion, we should generalize the concept of stop-words. And the stop-words could vary for corpora from different domains. I am wondering if we can define the stop-words mathematically, such as by its statistical characteristics. And then can we automatically extract stop-words from a corpora for a specific domain.

Is there any similar thought and progress on this? Could anyone shed some light?

smwikipedia
  • 61,609
  • 92
  • 309
  • 482
  • 2
    short answer: depending on your corpus and task, you can set up different stop word list. Getting the cut-off term frequency value is magic. – amirouche Apr 04 '18 at 19:09

5 Answers5

4

Stop words are ubiquitous. They will appear in every (or almost every) document. A good way to mathematically define stop words for corpora from different domains is to compute the inverse document frequency (IDF) of a word.

IDF is a better way over frequency computation to define stop words because simple frequency calculations are adversely affected by a few specialized documents containing a special word many times. This method has been used to automatically learn stop words in foreign languages (ref. Machine Learning with SVM and Other Kernel Methods).

Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
4

Yes, stop-words can be detected automatically.

Word frequencies as a whole

One way is to look at word frequencies as a whole.

Calculate the frequency of all words in the combined texts. Sort them in falling order and remove the top 20% or so.

You may also wish to remove the bottom 5%. These are not stop-words, but for a lot of machine learning, they are inconsequential. Maybe even misspellings.

Words per "document"

Another way is to analyze words per "document."

In a set of documents, stop-words can be detected by finding words that exist in a large number of documents. They would be useless for categorizing or clustering documents in this particular set.

E.g. a machine learning system categorizing scientific papers might, after analysis mark the word "abstract" as a stop-word, even though it may only exist once per document. But in all likelihood in almost all of them.

The same would be true for words that are only found in a very limited number of documents. They are likely misspelled or so unique they might never be seen again.

However, in this case, it's important that the distribution between document groups in the learning set is even or a set divided into one large and one small group might lose all its significant words (since they may exist in too many documents or too few).

Another way to avoid problems with unevenly distributed groups in the training set is to only remove words that exist in all or almost all documents. (I.e. our favorite stop-words like "a", "it", "the", "an", etc will exist in all English texts).

Zipf's Law

When I studied Machine Learning and the discussion of stop-words came up, Zipf's Law was mentioned. However, today I couldn't tell you how or why, but maybe it's a general principle or mathematical foundation you'd want to look into...

I googled "Zipf's Law automatic stop word detection" and a quick pick found me two PDFs that may be of interest...

Erk
  • 1,159
  • 15
  • 9
3

Usually the stop-words occurs much more frequently than the other semantic words...So while building my application I used the combination of both; a fixed list and the statistical method. I was using NLTK and it already had a list of some common stop words; so I first removed the words which appears in this list, but of-course this didn't removed all the stop-words...As you already mentioned that the stop words differs from corpora to corpora. Then I evaluated the frequency of each word appearing in the corpora and removed the words which have a frequency above a "certain limit". This certain limit which I mention, was the value I fixed after observing the frequency of all the words...hence again this limit also depends on corpora to corpora...but you can easily calculate this once you carefully observe the list of all the words in order of their frequency...This statistical method will ensure that you are removing the Stop-Words which do not appears in list of common stop-words...After that to refine the data I also used POS tagging...and removed the proper nouns which still exist after the first two steps..

sumitb.mdi
  • 1,010
  • 14
  • 17
3

I am not an expert, but hope my answer makes sense.

Statistically extracting stop words from a corpus sounds interesting! I would consider calculating inverse document frequency, as mentioned in the other answers, apart from using regular stop words from a common stop-word list, like the one in NLTK. Stop words not only vary from corpora to corpora, they may also vary from problem to problem. For example, in one of the problems I was working, I was using a corpus of news articles, where you find a lot of time-sensitive and location-sensitive words. These were crucial information, and statistically removing words like "today", "here", etc. would have affected my results dearly. Because, news articles talk about not just one particular event, but also similar events that had happened in the past or in another location.

My point, in short, is that you would need to consider the problem being addressed as well, and not just the corpus.

Thanks, Ramya

Ramya
  • 284
  • 2
  • 5
  • 18
0

Actually the common approach to build stopwords is to just use the most common (in documents, i.e. by DF) words. Build a list of the top 100, 200, 1000 words, and review them. Just browse the list until you find a word that in your opinion should not be a stopword. Then consider to either skip it, or break the list at this point.

In many data sets, you will have domain specific stopwords. If you use StackOverflow for example, "java" and "c#" could well be stopwords (and this actually won't harm much; in particular if you still also use the tags). Other domain specific stop words could be "code", "implement", "program".

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194