I am using OpenNLP java for converting strings into tokens. However, I find that the round bracket can not be identified properly.
The code I am using: `
InputStream is = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("the string");`
For example, the string is "people like me (are) turning off the news".
The output is that:
people
like
me
(are
)
turning
off
the
news
The left round bracket of "are" was not identified. Also, for example, "401(k)" is convert to "401(k", and ")".
I also tried "SimpleTokenizer" class. It can separate brackets but also separate "front-page" to "front" and "page" which is not I want.
I am wondering if there is any solution?
Thanks.