6

I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.

text = u"This document is only an example. " \
    "I would like to create a custom pipeline that will remove specific tokesn from the final document."

doc = nlp(text)

def keep_token(tok):
    # This is only an example rule
    return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}

final_tokens = list(filter(keep_token, doc))

# How to get a spacy.Doc from final_tokens?

I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.

Kon Pal
  • 546
  • 1
  • 3
  • 13

2 Answers2

15

I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.

You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.

Code:

import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy

def remove_tokens_on_match(doc):
    indexes = []
    for index, token in enumerate(doc):
        if (token.pos_  in ('PUNCT', 'NUM', 'SYM')):
            indexes.append(index)
    np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
    np_array = numpy.delete(np_array, indexes, axis = 0)
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
    doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
    return doc2

# load english model
nlp  = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))

You can look to a similar question that I answered here.

gdaras
  • 9,401
  • 2
  • 23
  • 39
  • 2
    Doesn't this approach defeat the purpose of using the numpy array? That is, if you have a large body of text, it's computationally inefficient to loop through all the tokens in a doc to check their tag, then convert to numpy array then to a doc. Wouldn't it be faster to filter tags directly in a list comprehension? I am looking for a way to filter out tokens in a document in one shot without having to loop through tokens, which is what I would hope the numpy trick would do. However, your code still iterates through each token in the doc, making this computationally slow. Any ideas? – Mark Clements Mar 04 '21 at 09:29
2

Depending on what you want to do there are several approaches.

1. Get the original Document

Tokens in SpaCy have references to their document, so you can do this:

original_doc = final_tokens[0].doc

This way you can still get PoS, parse data etc. from the original sentence.

2. Construct a new document without the removed tokens

You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws.

doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))

This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.

If neither of those was what you had in mind, let me know and maybe I can help.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • 3
    I am aware of the second solution, and it is basicaly the workaround we are doing at the moment. But it has two problems: 1. the PoS tag may change as you exactly pointed 2. You need to reparse the document so performance drop. – Kon Pal Jul 31 '17 at 08:39
  • Can you explain what you want to do with the document after you get it? Why do you want to remove the tokens? – polm23 Jul 31 '17 at 08:54