4

Summary: Unable to find the model file used for Lemmatizer (english-lemmatizer.bin)

Details: OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin, which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step:

InputStream dictLemmatizer = null;

try (dictLemmatizer = new FileInputStream("english-lemmatizer.bin")) {

}

However, unlike other model files, I am just not able to find the location of this model file. Any pointers would be appreciated.

Sandeep
  • 1,245
  • 1
  • 13
  • 33

2 Answers2

9

The book "Natural Language Processing with Java Cookbook' by Richard M. Reese provides a good answer. For some reason en-lemmatizer.bin is not available for direct download from the web, but it can be created using the following steps:

  1. Download and untar apache-opennlp-1.9.0-bin.tar (https://opennlp.apache.org/download.html)

  2. Go to the URL for the Lemmatizer Training File and save the text content as en-lemmatizer.dict

  3. Go to the bin directory (from step 1, after untarring) and execute the following command:

opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8


Note: Be prepared to handle the following error:

Computing event counts... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Sandeep
  • 1,245
  • 1
  • 13
  • 33
0

You want en-lemmatizer.bin and not english-lemmatizer.txt

geezer57
  • 17
  • 1
  • 9