Stanford POS Tagger not tagging Chinese text

Question

I'm using Stanford POS Tagger (for the first time) and while it tags English correctly, it does not seem to recognize (Simplified) Chinese even when changing the model parameter. Have I overlooked something?

I've downloaded and unpacked the latest full version from here: http://nlp.stanford.edu/software/tagger.shtml

Then I've inputed sample text into the "sample-input.txt".

这是一个测试的句子。这是另一个句子。

Then I simply run

./stanford-postagger.sh models/chinese-distsim.tagger sample-input.txt

The expected output is to tag each of the words with a part of speech, but instead it recognizes the entire string of text as one word:

Loading default properties from tagger models/chinese-distsim.tagger

Reading POS tagger model from models/chinese-distsim.tagger ... done [3.5 sec].

這是一個測試的句子。這是另一個句子。#NR

Tagged 1 words at 30.30 words per second.

I appreciate any help.

Also, I have already checked that the file and settings are in UTF-8. I've also tried with different sample texts. — Ryan Rapp, Apr 18 '13 at 04:02

score 6 · Accepted Answer · answered Apr 18 '13 at 21:14

6

I finally realized that tokenization/segmentation is not included in this pos tagger. It appears the words must be space delimited before feeding them to the tagger. For those interested in maximum entropy word segmentation of Chinese, there is a separate package available here:

http://nlp.stanford.edu/software/segmenter.shtml

Thanks everyone.

answered Apr 18 '13 at 21:14

Ryan Rapp

1,583
13
18

2

yes, you need to pass into the segmenter before passing into the POS tagger. – alvas Apr 19 '13 at 01:05

Stanford POS Tagger not tagging Chinese text

1 Answers1