Problem?
Given the input sentence:
Hello, Mr. Anderson.
The default sentence tokenizer (punkt, and the nltk sample pickle), turns this into:
Sentence 1: Hello, Mr
Sentence 2: Anderson
When it should really remain as it was.
Why is an existing solution so hard to find?
Is there a general solution to this? This seems to be a common problem, as it is even mentioned in the nltk python tutorial book.
Sentence segmentation is difficult because period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.
However all of the solutions I've seen are either based on manually entering abbreviations, such as this one; or on training a new pickle - because my searches will not turn up ones that others have trained (the first rule of sentence boundary disambiguation...).
Manually building a list of English abbreviations is a bit of a daunting task; and I haven't found any clear documentation for such a list in nltk.
My current approach: I'm trying to write a webscraper to use this list. And I hate it. The list is far from complete, best I can really hope for is to expand the webscraper to combine several such lists. I'll then use the list to form hypothetical abbreviation expansions and see if they make sense... who am I kidding? I'll probably go back to bed.