There seem to be lots and lots of libraries out there that can find sentence boundaries.
The reason I need to find these is to chunk up longer texts so I can send them to language models.
This means once I have my chunks made up of complete sentences I will need to apply further tokenization (namely BPE) before I actually send the tokens to the models.
Most of these libraries (NNSplit, pyBSD, blingfire etc.) seem to support both tasks but I haven't seen about performing both tasks at the same time:
- Output BPE tokens
- Grouped by inferred sentence boundaries
I want to use some translation models on HuggingFace which means I can't chunk up the texts breaking them up in the middle of the sentences as this results in messed up translations. How does Google Translate deal with this? It performs quite well on texts of arbitrary length which suggests that it chunks the texts up somehow. It also deals with erroneous punctuation quite well.