9

I am using the NameFinder API example doc of OpenNLP. After initializing the Name Finder the documentation uses the following code for the input text:

for (String document[][] : documents) {

  for (String[] sentence : document) {
    Span nameSpans[] = nameFinder.find(sentence);
    // do something with the names
  }

  nameFinder.clearAdaptiveData()
}

However when I bring this into eclipse the 'documents' (not 'document') variable is giving me an error saying the variable documents cannot be resolved. What is the documentation referring to with the 'documents' array variable? Do I need to initialize an array called 'documents' which hold txt files for this error to go away?

Thank you for your help.

Chris
  • 18,075
  • 15
  • 59
  • 77

1 Answers1

16

The OpenNLP documentation states that the input text should be segmented into documents, sentences and tokens. The piece of code you provided illustrates how to deal with several documents.

If you have only one document you don't need the first for, just the inner one with the array of sentences, which is composed by as an array of tokens.

To create an array of sentences from a document you can use the OpenNLP SentenceDetector, and for each sentence you can use OpenNLP Tokenizer to get the array of tokens.

Your code will look like this:

// somehow get the contents from the txt file 
//      and populate a string called documentStr

String sentences[] = sentenceDetector.sentDetect(documentStr);
for (String sentence : sentences) {
    String tokens[] = tokenizer.tokenize(sentence);
    Span nameSpans[] = nameFinder.find(tokens);
    // do something with the names
    System.out.println("Found entity: " + Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
}

You can learn how to use the SentenceDetector and the Tokenizer from OpenNLP documentation documentation.

wcolen
  • 1,401
  • 10
  • 15
  • Thank you for your reply! I plugged that in but still get error: "Type mismatch: cannot convert from element type String to String[]" and the sentences variable is erroring on me on line 5: for(String[] sentence: sentences){ – Chris Apr 17 '12 at 02:30
  • Yes, there was an error. Just removed the [] from for (String sentence[] : sentences). Thank you. – wcolen Apr 17 '12 at 12:51
  • wcolen, thanks for all of your help. The only issue when I delete the array syntax [] is the next line now barks at me because the find method takes an array as argument, so sentence doesnt work: Span nameSpans[] = nameFinder.find(sentence); – Chris Apr 17 '12 at 12:58
  • oops... sorry again. I see it now. The tokenization command is missing. I will fix it for you. – wcolen Apr 17 '12 at 13:25
  • 1
    I also improved the example output. The nameSpans are pointing to the start and end index of the tokens array, so we should use the method Span.spansToStrings to print the output. – wcolen Apr 17 '12 at 13:48
  • wcolon, are you getting a memory location output from your println? I am getting [Ljava.lang.String;@17b0b765 for my output from Span.spansToStrings. Thank you again for your time and knowledge! This is starting to make much more sense. – Chris Apr 17 '12 at 14:06
  • That is the problem with Java, it can never guess what you want to do :P The Span.spansToStrings returns an array, to print its contents you can use Arrays.toString – wcolen Apr 17 '12 at 14:10