1

I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following

./bin/opennlp POSTagger -lang th -model thai.tok.bin < sentence.txt > output.txt

I put thai.tok.bin that I downloaded on the directory that I call from and run the following. sentence.txt has this text inside กินอะไรยังนาย. However, the output I got has only these text:

Usage: opennlp POSTagger model < sentences
Execution time: 0.000 seconds

I'm pretty new to OpenNLP, please let me know if anyone knows how to get output from it.

titipata
  • 5,321
  • 3
  • 35
  • 59

1 Answers1

5

The models from your link are outdated. First you need some manual steps to convert the model.

  1. Download the file thai.tok.bin.gz and extract to an empty folder. Rename the extracted file thai.tok.bin to token.model
  2. In the same folder, create a file named manifest.properties with the following contents:

    Manifest-Version=1.0.  
    Language=th  
    OpenNLP-Version=1.5.0  
    Component-Name=TokenizerME  
    useAlphaNumericOptimization=false  
    
  3. Now you can zip the files, if you are using Linux you can use this command: zip thai.tok.bin token.model manifest.properties

  4. Try your model:

    sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin <  thai_sentence.txt
    
    
    
    Loading Tokenizer model ... done (0,097s)     
    กินอะไร ยังนาย     
    
    
    Average: 333,3 sent/s      
    Total: 1 sent     
    Runtime: 0.003s     
    Execution time: 0,108 seconds 
    

Now that you have the updated tokenizer, you can do similar with the POS Tagger model.

  1. Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file thai.tag.bin to pos.model

  2. In the same folder, create a file named manifest.properties with the following contents:

    Manifest-Version=1.0
    Language=th
    OpenNLP-Version=1.5.0
    Component-Name=POSTaggerME
    
  3. Now you can zip the files, if you are using Linux you can use this command: zip thai.pos.bin pos.model manifest.properties

Finally, we can try the two models combined:

sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt

The result is:

กินอะไร_VACT ยังนาย_NCMN

Please, let me know if this is the expected result.

titipata
  • 5,321
  • 3
  • 35
  • 59
wcolen
  • 1,401
  • 10
  • 15
  • This is great and very helpful! Thanks @wcolen! I tried `thai.tok.bin` and it worked. I suppose the `pos.model` will work too. – titipata Apr 28 '17 at 19:33