1

I would like to split a string into sentences. As this is not straightforward (due to many "." that are not end of sentences) I am using a BreakIterator as follows:

public static List<String> textToSentences(String text) {
    BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
    iterator.setText(text);
    List<String> sentences = new ArrayList<String>(); // empty list
    String oneSentence = "";
    int start = iterator.first();
    int ctr = 0;
    for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
        oneSentence = text.substring(start,end);
        System.out.println(ctr + ": " + oneSentence);
        sentences.add(oneSentence);
        ctr += 1;
    }
    return sentences;
}

If I test this now on:

String text = "This is a test. This is test 2 ... This is test 3?  This is test 4!!! This is test 5!?  This is a T.L.A. test. Now with a Dr. in it. And so associate-professor Dr. Smith said that it was 567 B.C.. Hi there! There is one thing: go home!";

The result is:

0: This is a test. 
1: This is test 2 ... 
2: This is test 3?  
3: This is test 4!!! 
4: This is test 5!?  
5: This is a T.L.A. test. 
6: Now with a Dr. in it. 
7: And so associate-professor Dr. 
8: Smith said that it was 567 B.C.. 
9: Hi there! 
10: There is one thing: go home!

In sentence 6 it correctly ignores the Dr. but in sentence 7 it breaks after the Dr. (7+8 should be one sentence) . Why is this the case and how do I fix it?

lordy
  • 610
  • 15
  • 30
  • I don't think this is doable with `BreakIterator`. The previous sentence could have ended with "Dr", and a sentence could perfectly start with "Smith". You need some NLP to decide where sentences end. – Sweeper Feb 21 '20 at 08:47
  • 1
    I think in this examples your program gets confused because of `capital` S in the `Smith` so that is the reason why it breaking it. In my opinion you could somehow create special case for that kind of sentences. – noname Feb 21 '20 at 08:48

0 Answers0