-2

Given text without any punctuation or capitalization, I am looking for a way to split it into sentences. I need to be able to handle ten thousand words per second.

I have tried the following:

The last option was the fastest in my experiments, but still about 10 times slower than my requirement.

The above are general punctuators and I am interested only in sentence boundaries. In addition, I can sacrifice some accuracy for speed. Are there other tools or parameters in the above tools that I could look at?

UPDATE This model is the new winner. It is the fastest so far.

AlwaysLearning
  • 7,257
  • 4
  • 33
  • 68
  • To clarify, is your problem *only* one of splitting text into sentences? Or, are you attempting to split text into sentences to then properly punctuate those sentences? I am trying to establish if its possible to do away with the punctuation task. If so, that would open up options for potentially faster systems/models that only focus on "sentence splitting". – Kyle F Hartzenberg Aug 27 '23 at 01:51
  • @KyleFHartzenberg That is correct. I am not interested in punctuation, but rather only in sentence boundaries. I edited both the title and the question to reflect this more clearly. – AlwaysLearning Aug 27 '23 at 06:44
  • I thought that might offer some options but the number of systems designed for this task specifically when there is no punctuation *and* no capitalisation is slim. [The work](https://aclanthology.org/2023.acl-long.398/) by Minixhofer et al. (2023) available on GitHub [here](https://github.com/bminixhofer/wtpsplit) seems to be closest recent attempt at this problem. However, the system was not trained on data with varied capitalisation (see [this discussion](https://github.com/bminixhofer/wtpsplit/issues/101) on an idea for retraining). – Kyle F Hartzenberg Aug 27 '23 at 07:48
  • Other keywords/terms to use in your search for existing models/systems that may be of use are "sentence boundary disambiguation" and "sentence segmentation". – Kyle F Hartzenberg Aug 27 '23 at 07:54
  • @KyleFHartzenberg Minixhofer et al. attempt to do this in unsupervised manner. I don't have any problem if the model is trained using a fully punctuated dataset. Hence, I looked into their related work section. They imply that Spacy is capable of the task! Trying it: https://spacy.io/usage/linguistic-features#sbd-parser – AlwaysLearning Aug 27 '23 at 09:37
  • @KyleFHartzenberg Spacy's model seems to be trained on punctuated text! I don't get it how they can write in the documentation: "The recall for the senter is typically slightly lower than for the parser, which is better at predicting sentence boundaries **when punctuation is not present**." Without punctuation, it does not work at all... – AlwaysLearning Aug 27 '23 at 09:54

0 Answers0