Tokenizing texts in both Chinese and English improperly splits English words into letters

Question

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学的 M e l i s s a D e l l. How do I modify this behavior?

score 4 · Answer 1 · answered Feb 07 '20 at 11:18

4

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']

answered Feb 07 '20 at 11:18

28potato

41
3

score 0 · Answer 2 · answered Aug 31 '17 at 00:32

I can't speak for nltk , but Stanford CoreNLP will not exhibit this behavior if run on this sentence.

If you issue this command on your example you get proper tokenization:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text

You might want to look into using stanza if you want to access Stanford CoreNLP via Python.

More info here: https://github.com/stanfordnlp/stanza

Tokenizing texts in both Chinese and English improperly splits English words into letters

2 Answers2