1

Using a Java BreakIterator, I am able to extract words from a string. However, given the following string that uses parenthesis to indicate that a word could be plural, the parentheses are recognized as their own word.

String test = "Please enter the number of dependent(s).";

BreakIterator iterator = BreakIterator.getWordInstance(Locale.US);
iterator.setText(test);

int start = iterator.first();
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
    System.out.println(test.substring(start, end));
}

Outputs:

Please
 
enter
 
the
 
number
 
of
 
dependent
(
s
)
.

When I would expect:

Please
 
enter
 
the
 
number
 
of
 
dependent(s)
.

Is it possible to use a custom implementation of a break iterator so that a word with an "optional plural" is in fact treated as one word?

Dynamic
  • 497
  • 1
  • 10
  • 17
  • 2
    From the [Javadoc](https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html#word): Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. **Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides**. If you want to change this behaviour, you will have to write your own wrapper to implement the check for optional plurals. – Kirit Feb 22 '22 at 02:34
  • I concur with Kirit. I have been looking into this for about 45 minutes and that seems to be the only resolution..... OR just do a simple `String.split("\\s")` to blindly split a string delimiting by spaces. – hfontanez Feb 22 '22 at 03:21

0 Answers0