1

Given a paragraph, I want to split it into sentences. At the moment I'm simply doing this:

var sentences = paragraph.split('.');

It works for the most part, however starts failing when it's given a sentence like this:

Alaska is the largest state in the U.S.

Because U.S. has periods, it's parsing out S to be a sentence.

What's the best way to determin the sentences in a paragraph? I thought about parsing them out based on the last period before a capitol letter, but if the paragraph isn't well typed (a lowercase letter after the period) it will also fail on that

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • 1
    Am I wrong or are you less interested in JavaScript than in the theory of sentence detection ? Then it's probably more a question for http://programmers.stackexchange.com/ – Denys Séguret May 26 '13 at 18:14
  • Ah, welcome to regex-problems. That said, why not: `split('.\s+')`? (Though I second dystroy's suggestion, regex parsing-of-language/grammar is awkward). – David Thomas May 26 '13 at 18:14
  • 1
    Don't forget that a sentence can end in something's else than a dot! – Denys Séguret May 26 '13 at 18:16
  • If you want this algorithm to be accurate, you are asking for something that is very complicated. – mzedeler May 26 '13 at 18:16
  • @DavidThomas: What about *J. R. "Bob" Dobbs wants to sell you something.*? The `\s+` doesn't quite cut it. – mu is too short May 26 '13 at 18:19
  • @muistooshort, indeed. Sentence-parsing (given the various alternatives of sentence-demarcation, and punctuation-use mid-sentence) is hellish to work with reliably. And there will *always* be edge-cases unaccounted for. – David Thomas May 26 '13 at 18:23

1 Answers1

0

I would first tokenize the paragraph into words by splitting on whitespace. Then reassembly the sentences looking for words ending in period, question mark and exclamation mark. If it ends in a period, check if the word has more than one period in it - if so, then it is an abbreviation and not the end of a sentence.

Andrew - OpenGeoCode
  • 2,299
  • 18
  • 15
  • 1
    It's still far from perfect though, any sentence with Dwight D. Eisenhower would be invalid. – nyson May 26 '13 at 18:26