5

How to detect if word is a stopword after stemming and lemmatization in spaCy?

Assume sentence

s = "something good\nsomethings 2 bad"

In this case something is a stopword. Obviously (to me?) Something and somethings are also stopwords, but it needs to stemmed before. Following script will say that the first is true, but latter isn't.

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
tokenizer = Tokenizer(nlp.vocab)

s = "something good\nSomething 2 somethings"
tokens = tokenizer(s)

for token in tokens:
  print(token.lemma_, token.is_stop)

Returns:

something True
good False
"\n" False
Something False
2 False
somethings False

Is there a way to detect that through spaCy API?

smci
  • 32,567
  • 20
  • 113
  • 146
Dawid Laszuk
  • 1,773
  • 21
  • 39

1 Answers1

9

Stop words in spaCy are just a set of strings which set a flag on the lexemes, the context-independent entries in the vocabulary (see here for the English stop list). The flag simply checks whether text in STOP_WORDS, which is why "something" returns True for is_stop, and "somethings" doesn't.

However, what you can do is check if the token's lemma or lowercase form is part of the stop list, which is available via nlp.Defaults.stop_words (i.e. the defaults of the language you're using):

def extended_is_stop(token):
    stop_words = nlp.Defaults.stop_words
    return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words

If you're using spaCy v2.0 and want to solve this even more elegantly, you could also implement your own is_stop function via a custom Token attribute extension. You can choose any name for your attribute and it will become available via token._., for example token._.is_stop:

from spacy.tokens import Token
from spacy.lang.en.stop_words import STOP_WORDS  # import stop words from language data

stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
Token.set_extension('is_stop', getter=stop_words_getter)  # set attribute with getter

nlp = spacy.load('en')
doc = nlp("something Something somethings")
assert doc[0]._.is_stop  # this was a stop word before, and still is
assert doc[1]._.is_stop  # this is now also a stop word, because its lowercase form is
assert doc[2]._.is_stop  # this is now also a stop word, because its lemma is
Ines Montani
  • 6,935
  • 3
  • 38
  • 53
  • Thanks, this is good, but was hoping that I wouldn't have to use Python level method. It'd be nice to have something lower level. Do you think this is the only option? – Dawid Laszuk Nov 28 '17 at 17:19
  • 1
    The Cython API is actually quite nice to work with if you want very low level access. You can call `Lexeme.set_struct_attr(lex_ptr, attr_id, attr_value)` to set a value onto the lexeme. You can get a pointer to the lexeme by looking it up with `vocab.get()` or from `doc.c[i].lex`. With the latter, you'll need to cast through const to modify, because all tokens of the same type point to the same lexeme struct. The fact that the lexeme is shared explains why the stop words behave like they do: the flag is set on the lexeme, which is keyed by an exact string. – syllogism_ Nov 28 '17 at 18:51