1

I know similar questions were asked:

Spacy custom sentence spliting

Custom sentence boundary detection in SpaCy

yet my situation is a little different. I want to inherit from the spacy Sentencizer() with:

from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        self.tok = create_mySentencizer() # returning the sentences

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            # do set the boundaries with tok.is_sent_start 
        return doc

Even tho splitting works fine if I call doc = nlp("Text and so on. Another sentence.") after updating the model:

  nlp = spacy.load("some_model")
  sentencizer = MySentencizer()
  nlp.add_pipe(sentencizer, before="parser")
  # update model 

when i want to save the trained model with:

nlp.to_disk("path/to/my/model")

I get the following error:

AttributeError: 'MySentencizer' object has no attribute 'punct_chars'

Contrary, if i use the nlp.add_pipe(nlp.create_pipe('sentencizer')) the error does not occur. I wonder at what point I should have set the punct_chars attribute. It should have been inherited from the superclass?

If i replace the Sentencizer from the class and do object according to the first post, it works, but I may lose some valuable information on the way e.g. punct_chars?

Thanks for help in advance.

Chris

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
ChrisDelClea
  • 307
  • 2
  • 8
  • Be aware that you probably don't want to extend the `Sentencizer` like this. As it is, `nlp()` will call your custom method `__call__` but `nlp.pipe()` will call the `Sentencizer.pipe`, which will apply a completely different sentence segmentation. Instead, if you're concerned about serialization, you can implement dummy `to/from_bytes/disk` methods in your custom component that don't do anything. Alternatively, you can also implement `pipe` in your subclass, but if you're not using the sentencizer punctuation or methods, it'd be cleaner for your class to be separate. – aab Nov 30 '20 at 11:56
  • Thanks @aab for your comment. I am aware on this problems now. Can you give me an example of how an implementation of a dummy to/from_bytes/disk method in my component looks like, that does not do anything? Or more intersting, how to wirte a pip subclass for my case? – ChrisDelClea Dec 03 '20 at 22:21
  • 1
    Look at the `DummyTokenizer` in `spacy.util` as an example. – aab Dec 04 '20 at 07:59

1 Answers1

1

The following should do (note super(MySentencizer, self).__init__()):

import spacy
from spacy.pipeline import Sentencizer

class MySentencizer(Sentencizer):
    def __init__(self):
        super(MySentencizer, self).__init__() 

    def __call__(self, *args, **kwargs):
        doc = args[0]
        for tok in doc:
            tok.is_sent_start = True if tok.orth == "." else False
        return doc

nlp = spacy.load("en_core_web_md")
sentencizer = MySentencizer()
nlp.add_pipe(sentencizer, before="parser")

nlp.to_disk("model")
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72