3

Problem?

Given the input sentence:
Hello, Mr. Anderson.

The default sentence tokenizer (punkt, and the nltk sample pickle), turns this into:
Sentence 1: Hello, Mr
Sentence 2: Anderson

When it should really remain as it was.

Why is an existing solution so hard to find?

Is there a general solution to this? This seems to be a common problem, as it is even mentioned in the nltk python tutorial book.

Sentence segmentation is difficult because period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.

However all of the solutions I've seen are either based on manually entering abbreviations, such as this one; or on training a new pickle - because my searches will not turn up ones that others have trained (the first rule of sentence boundary disambiguation...).

Manually building a list of English abbreviations is a bit of a daunting task; and I haven't found any clear documentation for such a list in nltk.

My current approach: I'm trying to write a webscraper to use this list. And I hate it. The list is far from complete, best I can really hope for is to expand the webscraper to combine several such lists. I'll then use the list to form hypothetical abbreviation expansions and see if they make sense... who am I kidding? I'll probably go back to bed.

Alter
  • 3,332
  • 4
  • 31
  • 56
  • 2
    This is a common question in the `nltk` tag: "I have a general-purpose statistical tool, but it makes mistakes. Including some cases that seem obvious to me. What should I do?" Answer: First *evaluate* (measure) the tool you have. Is it really a problem? Don't be put off by one ridiculous failure. If performance is unacceptably low: Choose a better tool (not always possible), build or train a better tool (hard), or pre- or post-process the results of the one you have. Again, choose by **evaluating** the performance of the result. – alexis Jun 30 '16 at 10:34

1 Answers1

1

You can take a reactionary, exclusionary tack: anything that is at least two letters and not a legal word, followed by a period, must be an abbreviation. Would your text corpus allow this? The drawbacks are typos, other misspellings, gratuitous use of SMS, slang, or other words not "officially" recognized, and those cases where an abbreviation is also a legal word (such as "Ed. note").

If you want a comprehensive solution ... well, would a machine learning model be useful within your application? Feed it examples, let it learn which items in "period attire" [are | are not] abbreviations, and incorporate that into your chosen sentence slicer.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • 1
    You're guesstimating the reinvention of a well-known wheel. The `punkt` tokenizer (which comes with the nltk) uses an unsupervised learning algorithm to detect sentence boundaries, and supervised approaches are a dime a dozen. – alexis Jun 30 '16 at 10:31
  • "anything that is at least two letters and not a legal word, followed by a period, must be an abbreviation" Besides the drawbacks you mentioned, `Anderson` in `Hello, Mr. Anderson.` above is not an abbreviation. – Greg Jul 19 '16 at 16:05