Custom sentence boundary detection in SpaCy

Question

I'm trying to write a custom sentence segmenter in spaCy that returns the whole document as a single sentence.

I wrote a custom pipeline component that does it using the code from here.

I can't get it to work though, because instead of changing the sentence boundaries to take the whole document as a single sentence it throws two different errors.

If I create a blank language instance and only add my custom component to the pipeline I get this error:

ValueError: Sentence boundary detection requires the dependency parse, which requires a statistical model to be installed and loaded.

If I add the parser component to the pipeline

nlp = spacy.blank('es')
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, last=True)
def custom_sbd(doc):
    print("EXECUTING SBD!!!!!!!!!!!!!!!!!!!!")
    doc[0].sent_start = True
    for i in range(1, len(doc)):
        doc[i].sent_start = False
    return doc
nlp.begin_training()
nlp.add_pipe(custom_sbd, first=True)

I get the same error.

If I change the order for it to parse first and then change the sentence boundaries, the error changes to

Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

So if it throws an error requiring the dependency parse if it's not present or it executes after the custom sentence boundary detection, and a different error when the dependency parse is executed first, what's the appropriate way to do it?

Thank you!

score 5 · Accepted Answer · answered Jan 25 '18 at 13:25

Ines from spaCy answered my question here

Thanks for bringing this up – and sorry this is a little confusing. I'm pretty sure the first problem you describe is already fixed on master. spaCy should definitely respect custom sentence boundaries, even in pipelines with no dependency parser.

If you want to use your custom SBD component without a parser, a very simple solution would be to set doc.is_parsed = True in your custom component. So when Doc.sents checks for the dependency parse, it looks at is_parsed and won't complain.

If you want to use your component with the parser, make sure to add it before the parser. The parser should always respect already set sentence boundaries from previous processing steps.

Custom sentence boundary detection in SpaCy

1 Answers1

Linked