Questions tagged [stop-words]

Stop words are words that are filtered out prior (or after) the processing of natural language data.

In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text) (see ). There is not one definite list of stop words which all tools use, if even used. Some tools specifically avoid removing them to support phrase search.

Any group of words can be chosen as the stop words for a given purpose. For some search machines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words—including lexical words, such as "want" — from query in order to improve performance.

See also: Stop words - Wikipedia

671 questions
139
votes
13 answers

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words. I used NLTK to get a list of stop words: from nltk.corpus import stopwords stopwords.words('english') Exactly how do I compare the data to the list of stop words, and thus remove the…
Alex
  • 1,853
  • 5
  • 16
  • 15
79
votes
6 answers

Stopword removal with NLTK

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators…
Grahesh Parkar
  • 1,017
  • 1
  • 13
  • 16
68
votes
7 answers

NLTK and Stopwords Fail #lookuperror

I am trying to start a project of sentiment analysis and I will use the stop words method. I made some research and I found that nltk have stopwords but when I execute the command there is an error. What I do is the following, in order to know which…
Facundo
  • 729
  • 2
  • 6
  • 7
66
votes
8 answers

Add/remove custom stop words with spacy

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!
E.K.
  • 4,179
  • 8
  • 30
  • 50
60
votes
6 answers

Faster way to remove stop words in Python

I am trying to remove stopwords from a string of text: from nltk.corpus import stopwords text = 'hello bye the the hi' text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))]) I am processing 6 mil of such…
mchangun
  • 9,814
  • 18
  • 71
  • 101
37
votes
1 answer

Adding words to scikit-learn's CountVectorizer's stop list

Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?
statsNoob
  • 1,325
  • 5
  • 18
  • 36
29
votes
3 answers

SQL 2008: Turn off Stop Words for Full Text Search Query

I'm having quite a bit of difficulty finding a good solution for this: Let's say I have a table of "Company", with a column called "Name". I have a full-text catalog on this column. If a user searched for "Very Good Company", my query would…
John
  • 17,163
  • 16
  • 65
  • 83
28
votes
3 answers

adding words to stop_words list in TfidfVectorizer in sklearn

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But…
ac11
  • 927
  • 2
  • 11
  • 18
25
votes
1 answer

What is the default list of stopwords used in Lucene's StopFilter?

Lucene have a default stopfilter (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html), does anyone know which are words in the list?
alvas
  • 115,346
  • 109
  • 446
  • 738
23
votes
10 answers

Adding words to nltk stoplist

I have some code that removes stop words from my data set, as the stop list doesn't seem to remove a majority of the words I would like it too, I'm looking to add words to this stop list so that it will remove them for this case. The code i'm using…
Alex
  • 1,853
  • 5
  • 16
  • 15
21
votes
6 answers

"Stop words" list for English?

I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the". Where can I find some lists of these uninteresting words? Is a list of these words the same as a list of the most…
Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
20
votes
4 answers

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning:…
20
votes
3 answers

How to remove stopwords efficiently from a list of ngram tokens in R

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal. I'd very much like to have one…
Ken Benoit
  • 14,454
  • 27
  • 50
20
votes
4 answers

Tokenizer, Stop Word Removal, Stemming in Java

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the…
Phil
  • 665
  • 5
  • 9
  • 14
18
votes
3 answers

Full text search does not work if stop word is included even though stop word list is empty

I would like to be able to search every word so I have cleared the stop word list. Than I have rebuilt the index. But unfortunately if I type in a search expression with stop word in it it still returns no row. If I leave out just the stop word I do…
apolka
  • 1,711
  • 4
  • 16
  • 23
1
2 3
44 45