2

I'm using the Stanford coreNLP system with the following command:

java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt

And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need to use -annotators segment, but with this parameters the system outputs an empty file. I could run the tool using the ssplit annotator as well but I don't want to do that because my input is a parallel corpora that contains one sentence by line already, and the ssplit will probably not split sentences perfectly and create problems in the parallel data.

Is there a way to tell the system to do the segmentation only, or to tell it that the input already contains a sentence by line exactly?

alvas
  • 115,346
  • 109
  • 446
  • 738
dhokas
  • 1,771
  • 2
  • 13
  • 22

1 Answers1

3

Using Stanford Segmenter instead:

$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-04-20.zip
$ unzip stanford-segmenter-2015-04-20.zip
$ echo "应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事" > input.txt
$ bash stanford-segmenter-2015-04-20/segment.sh ctb input.txt UTF-8 0 > output.txt
$ cat output.txt
应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事

Other than Stanford Segmenter, there are many other segmenter might be more suitable, see Is there any good open-source or freely available Chinese segmentation algorithm available?


To continue using the Stanford NLP tools for pos tagging:

$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip
$ unzip stanford-postagger-full-2015-04-20.zip
$ cd stanford-postagger-full-2015-01-30/
$ echo "应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事" > input.txt
$ bash stanford-postagger.sh models/chinese-distsim.tagger input.txt > output.txt
$ cat output.txt 
应有尽有#VV 的#DEC 丰富#JJ 选择#NN 定#VV 将#AD 为#P 您#PN 的#DEG 旅程#NN 增添#VV 无数#CD 的#DEG 赏心#NN 乐事#NN
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thanks a lot alvas! Works perfectly! Do you know whether the segmentation model CTB from Chinese Treebank here is also the one used in the Stanford POS tagger? I will also need to use the POS tagger later, and if I could have the segmentation done in the same way, that would be great! – dhokas Jun 15 '15 at 08:57
  • 1
    Yes, the CTB is from the Chinese TreeBank and it should be coherent with the Stanford POS tagger. – alvas Jun 15 '15 at 09:12