5

I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :

public Map<String, Double> breakSentence(String document) {
    sentences = new HashMap<String, Double>();
    BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
    bi.setText(document);

    Double tfIdf = 0.0;
    int start = bi.first();
    for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
        String sentence = document.substring(start, end);

        sentences.put(sentence, tfIdf);
    }

    return sentences;
}

The problem is when the paragraph contain titles or numbers, for example :

"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."

What my code will produce is :

sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code

Instead of 1 single sentence because of the period in titles and numbers.

Is there a way to fix this to handle titles and numbers with Java?

james.garriss
  • 12,959
  • 7
  • 83
  • 96
caesardo
  • 348
  • 6
  • 15
  • I'm puzzled... This line from the documentation suggests there _should_ be a way to do it: "_Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses..._" That said, I've never used a `BreakIterator`. – jahroy Jun 18 '13 at 02:14
  • 1
    The 1.200 doesn't get split for me, though Prof. does get split. – user93353 Jun 18 '13 at 02:17
  • You may want to create your own method for that. and set up exceptions for prof. "mr." "Mrs." or any other variants you may run across in your input. – leigero Jun 18 '13 at 02:20
  • @user93353 - `Prof` will not get split if the next word does **not** begin with a capital letter... – jahroy Jun 18 '13 at 03:17
  • @leigero can you give me an illustration about setting up a method to handle it? will regex solve this problem? – caesardo Jun 18 '13 at 04:10
  • If you really want a robust solution, you should use a library. Here's [an answer](http://stackoverflow.com/a/4373687/778118) that suggests one. – jahroy Jun 19 '13 at 03:51
  • This problem is AI hard, so good luck. – Raedwald Jun 19 '13 at 07:19
  • While the `BreakIterator` has a lot going for it, it's not as smart as it should be. For example, it sees this as 2 sentences: "My favorite president is George W. Bush." It breaks on the period after the W. – james.garriss Jan 11 '16 at 16:56

2 Answers2

6

Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.

I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)

I'm including my complete program so you can see it all:

import java.text.BreakIterator;
import java.util.*;

public class TestCode {

    private static final String[] ABBREVIATIONS = {
        "Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
    };

    public static void main(String[] args) throws Exception {

        String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
                      "problem by writing a 1.200 lines of code. This will " +
                      "work if Mr. Java writes solid code.";

        for (String s : breakSentence(text)) {
              System.out.println(s);
        }
    }

    public static List<String> breakSentence(String document) {

        List<String> sentenceList = new ArrayList<String>();
        BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
        bi.setText(document);
        int start = bi.first();
        int end = bi.next();
        int tempStart = start;
        while (end != BreakIterator.DONE) {
            String sentence = document.substring(start, end);
            if (! hasAbbreviation(sentence)) {
                sentence = document.substring(tempStart, end);
                tempStart = end;
                sentenceList.add(sentence);
            }
            start = end; 
            end = bi.next();
        }
        return sentenceList;
    }

    private static boolean hasAbbreviation(String sentence) {
        if (sentence == null || sentence.isEmpty()) {
            return false;
        }
        for (String w : ABBREVIATIONS) {
            if (sentence.contains(w)) {
                return true;
            }
        }
        return false;
    }
}

What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:

"Prof."

and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:

"Roberts and Dr."

It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:

"Andrews trying to solve a problem by writing a 1.200 lines of code."

Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.

Now it sets the temp start to the current 'end' and continues.

Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug

jahroy
  • 22,322
  • 9
  • 59
  • 108
leigero
  • 3,233
  • 12
  • 42
  • 63
  • thanks for your code, can you elaborate how your code handle the numbers with period like 1.200? – caesardo Jun 18 '13 at 09:38
  • 1
    The numbering pert is automatically handled by the BreakIterator. Oracle documentation says: "Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses." so I'm not sure why yours wasnt doing that properly in the first place. – leigero Jun 18 '13 at 17:01
  • You offered, so I edited your code to make it a little more _Java-esque_... Feel free to reject my edit if you want. – jahroy Jun 19 '13 at 02:12
  • Great! I see its very much edited. I like seeing how other people fix code because I haven't learned from the 'industry' so its nice to see how the real world organizes things. – leigero Jun 19 '13 at 02:40
  • Cool. At the moment I don't think it works in all scenarios. I believe it doesn't work if an abbreviation is followed by a word that does not begin with a capital letter... So, that's not good. I believe the original version had the same issue (before I edited it). – jahroy Jun 19 '13 at 02:43
  • Really, this is a very complex problem. I don't think there's a true solution that fits in the scope of a StackOverflow question/answer. I would immediately search for a library if confronted with this task. Here's [another question/answer](http://stackoverflow.com/q/4373612/778118) that suggests a specific library. – jahroy Jun 19 '13 at 03:49
0

It appears that Prof. Roberts only gets split if Roberts begins with a capital letter.

If Roberts begins with a lowercase r, it does not get split.

So... I guess that's how BreakIterator deals with periods.

I'm sure further reading of the documentation will explain how this behavior can be modified.

Community
  • 1
  • 1
jahroy
  • 22,322
  • 9
  • 59
  • 108