I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc
structure.
text = u"This document is only an example. " \
"I would like to create a custom pipeline that will remove specific tokesn from the final document."
doc = nlp(text)
def keep_token(tok):
# This is only an example rule
return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
final_tokens = list(filter(keep_token, doc))
# How to get a spacy.Doc from final_tokens?
I tried to reconstruct a new spaCy Doc
from the tokens lists but the API is not clear how to do it.