4

I am using OpenNLP java for converting strings into tokens. However, I find that the round bracket can not be identified properly.

The code I am using: `

InputStream is = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("the string");`

For example, the string is "people like me (are) turning off the news". The output is that: people like me (are ) turning off the news

The left round bracket of "are" was not identified. Also, for example, "401(k)" is convert to "401(k", and ")".

I also tried "SimpleTokenizer" class. It can separate brackets but also separate "front-page" to "front" and "page" which is not I want.

I am wondering if there is any solution?

Thanks.

Yao
  • 61
  • 4

1 Answers1

3

Have a look at this article

It addressed the problem: Nonstandard sentence ends (parentheses)

which means some kind of preprocessing is required here!

and the solution is given here

what he basically did is tokenize brackets and parentheses by putting a space on either side like this:

sent = untokenizedParenPattern1.matcher(sent).replaceAll("$1 $2");
sent = untokenizedParenPattern2.matcher(sent).replaceAll("$1 $2");

It's not the only way to put space on either side of the parentheses, but doing this preprocessing helps you get the desired output!

Do share if your problem is solved,hope this helps!

iamgr007
  • 966
  • 1
  • 8
  • 28