0

I'm trying to figure out how to remove stop words from a spaCy Doc object while retaining the original parent object with all its attributes.

import en_core_web_md
nlp = en_core_web_md.load()

sentence = "The frigate was decommissioned following Britain's declaration of peace with France in 1763, but returned to service in 1766 for patrol duties in the Caribbean"

tokens = nlp(sentence)
print("Parent type:", type(tokens))
print("Token type:", type(tokens[0]))
print("Sentence vector:", tokens.vector)
print("Word vector:", tokens[0].vector)

returns:

Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 8.35970342e-02  1.38482109e-01  7.71872401e-02 -7.14236796e-02
...]
Word vector: [ 2.7204e-01 -6.2030e-02 -1.8840e-01  2.3225e-02 -1.8158e-02  6.7192e-03
...]

Typical solutions to removing stop words consist in using a list comprehension:

noStopWords = [t for t in tokens if not t.is_stop]
print("Parent type:", type(noStopWords))
print("Token type:", type(noStopWords[0]))
try:
    print("Sentence vector:", noStopWords.vector)
except AttributeError as e:
    print(e)
try:
    print("Word vector:", noStopWords[0].vector)
except AttributeError as e:
    print(e)

Since, now, the parent object is a list of Token objects, and no longer a Doc object, it no longer has the original attributes, so the code returns:

Parent type: <class 'list'>
Token type: <class 'spacy.tokens.token.Token'>
'list' object has no attribute 'vector'
Word vector: [ 9.4139e-01 -5.9546e-01  5.5007e-01  3.7544e-01  2.3021e-02 -4.4260e-01
...]

So the rather terrible way I could find is to rebuild a string from the tokens, and reprocess it. This sucks as it's double work, and the nlp method is already slow.

noStopWordsDoc = nlp(' '.join([t.text for t in noStopWords]))
print("Parent type:", type(noStopWordsDoc))
print("Token type:", type(noStopWordsDoc[0]))
try:
    print("Sentence vector:", noStopWordsDoc.vector)
except AttributeError as e:
    print(e)
try:
    print("Word vector:", noStopWordsDoc[0].vector)
except AttributeError as e:
    print(e)
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 9.78216752e-02  1.06186338e-01  1.66255698e-01 -9.38376933e-02
...]

Now, there must be a better way, right?

mrgou
  • 1,576
  • 2
  • 21
  • 45

1 Answers1

3

Directly quoting one of the developers of spaCy, Ines Montani:

One of the core principles of spaCy's Doc is that it should always represent the original input:

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

Refer to this answer: Can a token be removed from a spaCy document during pipeline processing?

thorntonc
  • 2,046
  • 1
  • 8
  • 20