I came across this class TokenizerME in opennlp documentation page(http://opennlp.apache.org/documentation/manual/opennlp.html). I am not getting how is it calculating the probabilies. I tested it with different inputs, still not understanding. Can someone help me understand the algorithm behind it? I wrote this sample code
public void tokenizerDemo(){
try {
InputStream modelIn = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("This is is book");
for(String t:tokens){
System.out.println("Token : "+t);
}
double tokenProbs[] = ((TokenizerME) tokenizer).getTokenProbabilities();
for(double tP : tokenProbs){
System.out.println("Token Prob : "+tP);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
I got this output
Token : This
Token : is
Token : is
Token : book
Token Prob : 1.0
Token Prob : 1.0
Token Prob : 1.0
Token Prob : 1.0
I want the token "is" to be counted twice and its probability should have been slightly higher than other tokens. Confused.