I'm trying to figure out how to remove stop words from a spaCy Doc
object while retaining the original parent object with all its attributes.
import en_core_web_md
nlp = en_core_web_md.load()
sentence = "The frigate was decommissioned following Britain's declaration of peace with France in 1763, but returned to service in 1766 for patrol duties in the Caribbean"
tokens = nlp(sentence)
print("Parent type:", type(tokens))
print("Token type:", type(tokens[0]))
print("Sentence vector:", tokens.vector)
print("Word vector:", tokens[0].vector)
returns:
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 8.35970342e-02 1.38482109e-01 7.71872401e-02 -7.14236796e-02
...]
Word vector: [ 2.7204e-01 -6.2030e-02 -1.8840e-01 2.3225e-02 -1.8158e-02 6.7192e-03
...]
Typical solutions to removing stop words consist in using a list comprehension:
noStopWords = [t for t in tokens if not t.is_stop]
print("Parent type:", type(noStopWords))
print("Token type:", type(noStopWords[0]))
try:
print("Sentence vector:", noStopWords.vector)
except AttributeError as e:
print(e)
try:
print("Word vector:", noStopWords[0].vector)
except AttributeError as e:
print(e)
Since, now, the parent object is a list of Token
objects, and no longer a Doc
object, it no longer has the original attributes, so the code returns:
Parent type: <class 'list'>
Token type: <class 'spacy.tokens.token.Token'>
'list' object has no attribute 'vector'
Word vector: [ 9.4139e-01 -5.9546e-01 5.5007e-01 3.7544e-01 2.3021e-02 -4.4260e-01
...]
So the rather terrible way I could find is to rebuild a string from the tokens, and reprocess it. This sucks as it's double work, and the nlp
method is already slow.
noStopWordsDoc = nlp(' '.join([t.text for t in noStopWords]))
print("Parent type:", type(noStopWordsDoc))
print("Token type:", type(noStopWordsDoc[0]))
try:
print("Sentence vector:", noStopWordsDoc.vector)
except AttributeError as e:
print(e)
try:
print("Word vector:", noStopWordsDoc[0].vector)
except AttributeError as e:
print(e)
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 9.78216752e-02 1.06186338e-01 1.66255698e-01 -9.38376933e-02
...]
Now, there must be a better way, right?