4

I'm using SpaCy to pre-process some data. However, I'm stuck on how to modify the content of the spacy.tokens.doc.Doc class.

For example, here:

npc = spacy.load("pt")
def pre_process_text(doc) -> str:
    new_content = ""
    current_tkn = doc[0]
    for idx, next_tkn in enumerate(doc[1:], start=0):
        # Pre-process data
        # new_content -> currently, it is the way I'm generating
        # the new content, concatenating the modified tokens

    return new_content
nlp.add_pipe(pre_process_text, last=True)

In the comment part inside the above code, there are some tokens that I would like to remove from doc param, or I would like to change its token text content. In other words, I can modify the content of spacy.tokens.doc.Doc by (1) removing tokens entirely, or (2) changing tokens contents.

Is there a way to create another spacy.tokens.doc.Doc with those modified tokens but keeping the Vocab from the npc = spacy.load("pt").

Currently, I'm generating the new content by returning a string, but is there a way to return the modified Doc?

Paulo Mann
  • 55
  • 1
  • 6

1 Answers1

12

One of the core principles of spaCy's Doc is that it should always represent the original input:

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency.

I've outlined some approaches for excluding tokens without destroying the original input in my comment here.

Alternatively, if you really want to modify the Doc, your component could create a new Doc object and return that. The Doc object takes a vocab (e.g. the original doc's vocab), a list of string words and an optional list of spaces, a list of booleans indicating whether the token at that position is followed by a space or not.

from spacy.tokens import Doc

def pre_process_text(doc):
    # Generate a new list of tokens here
    new_words = create_new_words_here(doc)
    new_doc = Doc(doc.vocab, words=new_words)
    return new_doc

Note that you probably want to add this component first in the pipeline before other components run. Otherwise, you'd lose any linguistic features assigned by previous components (like part-of-speech tags, dependencies etc).

Ines Montani
  • 6,935
  • 3
  • 38
  • 53