Given text without any punctuation or capitalization, I am looking for a way to split it into sentences. I need to be able to handle ten thousand words per second.
I have tried the following:
- https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large
- https://pypi.org/project/punctuator/
- https://pypi.org/project/distilbert-punctuator/
The last option was the fastest in my experiments, but still about 10 times slower than my requirement.
The above are general punctuators and I am interested only in sentence boundaries. In addition, I can sacrifice some accuracy for speed. Are there other tools or parameters in the above tools that I could look at?
UPDATE This model is the new winner. It is the fastest so far.