How to modify spacy.tokens.doc.Doc tokens with pipeline components in SpaCy

Question

I'm using SpaCy to pre-process some data. However, I'm stuck on how to modify the content of the spacy.tokens.doc.Doc class.

For example, here:

npc = spacy.load("pt")
def pre_process_text(doc) -> str:
    new_content = ""
    current_tkn = doc[0]
    for idx, next_tkn in enumerate(doc[1:], start=0):
        # Pre-process data
        # new_content -> currently, it is the way I'm generating
        # the new content, concatenating the modified tokens

    return new_content
nlp.add_pipe(pre_process_text, last=True)

In the comment part inside the above code, there are some tokens that I would like to remove from doc param, or I would like to change its token text content. In other words, I can modify the content of spacy.tokens.doc.Doc by (1) removing tokens entirely, or (2) changing tokens contents.

Is there a way to create another spacy.tokens.doc.Doc with those modified tokens but keeping the Vocab from the npc = spacy.load("pt").

Currently, I'm generating the new content by returning a string, but is there a way to return the modified Doc?

score 12 · Accepted Answer · answered Jul 25 '19 at 16:15

One of the core principles of spaCy's Doc is that it should always represent the original input:

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency.

I've outlined some approaches for excluding tokens without destroying the original input in my comment here.

Alternatively, if you really want to modify the Doc, your component could create a new Doc object and return that. The Doc object takes a vocab (e.g. the original doc's vocab), a list of string words and an optional list of spaces, a list of booleans indicating whether the token at that position is followed by a space or not.

from spacy.tokens import Doc

def pre_process_text(doc):
    # Generate a new list of tokens here
    new_words = create_new_words_here(doc)
    new_doc = Doc(doc.vocab, words=new_words)
    return new_doc

Note that you probably want to add this component first in the pipeline before other components run. Otherwise, you'd lose any linguistic features assigned by previous components (like part-of-speech tags, dependencies etc).

Say I want to lowercase the doc. What the line `new_doc=...` should be to ensure only lowercasing takes place? — Sergey Bushmanov, Jun 24 '20 at 19:19

How to modify spacy.tokens.doc.Doc tokens with pipeline components in SpaCy

1 Answers1

Linked