Chinese Segmentation : ICTCLAS Training Corpora

Asked Feb 23 '17 at 08:31

Active Feb 23 '17 at 08:31

Viewed 94 times

I am using the ICTCLAS segmentation tool for Chinese. We can read in "Automatic Recognition of Chinese Unknown Words Based on Roles Tagging" (Zhang, Liu, 2002) that it has been trained on the Peking University Corpus (PKU) : "The training corpus came from one-month news from the People’ s Daily with 2,305,896 Chinese characters, which are manually checked after word segmentation and POS tagging (It can be downloaded at icl.pku.edu.cn, the homepage of the Institute of Computational Linguistics, Peking University)."

But I didn't find any other mention of the datas they used since this 2002 paper, and would like to confirm that they still train the segmenter on PKU.

asked Feb 23 '17 at 08:31

Starckman

StackOverflow is probably not the best site for this type of subject. If it is still relevant, you might try Linguistics Stack Exchange, which already has many [questions about corpora](https://linguistics.stackexchange.com/questions/tagged/corpora). – Tsundoku Nov 14 '17 at 11:04

Chinese Segmentation : ICTCLAS Training Corpora

0 Answers0