Stop words in spaCy are just a set of strings which set a flag on the lexemes, the context-independent entries in the vocabulary (see here for the English stop list). The flag simply checks whether text in STOP_WORDS
, which is why "something" returns True
for is_stop
, and "somethings" doesn't.
However, what you can do is check if the token's lemma or lowercase form is part of the stop list, which is available via nlp.Defaults.stop_words
(i.e. the defaults of the language you're using):
def extended_is_stop(token):
stop_words = nlp.Defaults.stop_words
return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words
If you're using spaCy v2.0 and want to solve this even more elegantly, you could also implement your own is_stop
function via a custom Token
attribute extension. You can choose any name for your attribute and it will become available via token._.
, for example token._.is_stop
:
from spacy.tokens import Token
from spacy.lang.en.stop_words import STOP_WORDS # import stop words from language data
stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
Token.set_extension('is_stop', getter=stop_words_getter) # set attribute with getter
nlp = spacy.load('en')
doc = nlp("something Something somethings")
assert doc[0]._.is_stop # this was a stop word before, and still is
assert doc[1]._.is_stop # this is now also a stop word, because its lowercase form is
assert doc[2]._.is_stop # this is now also a stop word, because its lemma is