7

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学 的 M e l i s s a D e l l. How do I modify this behavior?

yhylord
  • 430
  • 4
  • 13

2 Answers2

4

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']
28potato
  • 41
  • 3
0

I can't speak for nltk , but Stanford CoreNLP will not exhibit this behavior if run on this sentence.

If you issue this command on your example you get proper tokenization:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text

You might want to look into using stanza if you want to access Stanford CoreNLP via Python.

More info here: https://github.com/stanfordnlp/stanza

StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9