I am using the ICTCLAS segmentation tool for Chinese. We can read in "Automatic Recognition of Chinese Unknown Words Based on Roles Tagging" (Zhang, Liu, 2002) that it has been trained on the Peking University Corpus (PKU) : "The training corpus came from one-month news from the People’ s Daily with 2,305,896 Chinese characters, which are manually checked after word segmentation and POS tagging (It can be downloaded at icl.pku.edu.cn, the homepage of the Institute of Computational Linguistics, Peking University)."
But I didn't find any other mention of the datas they used since this 2002 paper, and would like to confirm that they still train the segmenter on PKU.