I'm using SpaCy to pre-process some data. However, I'm stuck on how to modify the content of the spacy.tokens.doc.Doc
class.
For example, here:
npc = spacy.load("pt")
def pre_process_text(doc) -> str:
new_content = ""
current_tkn = doc[0]
for idx, next_tkn in enumerate(doc[1:], start=0):
# Pre-process data
# new_content -> currently, it is the way I'm generating
# the new content, concatenating the modified tokens
return new_content
nlp.add_pipe(pre_process_text, last=True)
In the comment part inside the above code, there are some tokens that I would like to remove from doc
param, or I would like to change its token text content. In other words, I can modify the content of spacy.tokens.doc.Doc
by (1) removing tokens entirely, or (2) changing tokens contents.
Is there a way to create another spacy.tokens.doc.Doc
with those modified tokens but keeping the Vocab
from the npc = spacy.load("pt")
.
Currently, I'm generating the new content by returning a string, but is there a way to return the modified Doc?