Using The Stanford Parser in Python on Chinese text not working

Question

I tried this code, but it didn't work:

# -*- coding:utf-8 -*-
from nltk.parse import stanford
s = '你好'.decode('utf-8')

print s
parser = stanford.StanfordParser(path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
print parser.raw_parse_sents(s)

The result prints 你 and 好 as two words:

你好
[Tree('ROOT', [Tree('NP', [Tree('NNP', ['\u4f60'])])]), Tree('ROOT', [Tree('NP', [Tree('NNP', ['\u597d'])])])]

but on the online parser(http://nlp.stanford.edu:8080/parser/index.jsp), the result is

Parse (ROOT (IP (VP (VV 你好))))

How do I fix my code to produce the same result as the online parser?

try switch to `s = '你好'.decode('utf-16')`? – OPK Mar 23 '15 at 02:14 — OPK, Mar 23 '15 at 02:14

leekaiinthesky · Accepted Answer · 2015-03-24T10:14:52.810

There are two (ok, three... see "Update 3" below for the third) separate things going on:

1) Your code is returning two trees (two ROOTs), but you only expect to get one. This is happening because raw_parse_sents expects a list of sentences, not a single sentence, and if you give it a string, it is parsing each character in the string as if it were its own sentence and returning a list of one-character trees. So either pass raw_parse_sents a list, or use raw_parse instead.

2) You haven't specified a model_path, and the default is English. There are five options for Chinese, but it looks like this one matches the online parser:

parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')

Combining these two changes, I am able to match the online parser (I also had to cast the returned listiterator to a list in order to match your output format):

from nltk.parse import stanford
s = '你好'.decode('utf-8')

print s.encode('utf-8')
parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
print list(parser.raw_parse(s))

> 你好
> [Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4f60\u597d'])])])])]

Update 1:

I realized you might be looking for an output format more like the one on the website as well, in which case this works:

for tree in parser.raw_parse(s):
    print tree # or print tree.pformat().encode('utf-8') to force an encoding

Update 2:

Apparently if your version of NLTK is earlier than 3.0.2, Tree.pformat() was Tree.pprint(). From https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0:

Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):

classify.decisiontree.DecisionTreeClassifier.pp → pretty_format

metrics.confusionmatrix.ConfusionMatrix.pp → pretty_format

sem.lfg.FStructure.pprint → pretty_format

sem.drt.DrtExpression.pretty → pretty_format

parse.chart.Chart.pp → pretty_format

Tree.pprint() → pformat

FreqDist.pprint → pformat

Tree.pretty_print → pprint

Tree.pprint_latex_qtree → pformat_latex_qtree

Update 3:

I am now trying to match the output for the sentence in your comment, '你好，我心情不错今天，你呢？'.

I referred to the Stanford Parser FAQ extensively while writing this response and suggest you check it out (especially "Can you give me some help in getting started parsing Chinese?"). Here's what I've learned:

In general, you need to "segment" Chinese text into words (consisting of one or more characters) separated by spaces before parsing it. The online parser does this, and you can see the output of both the segmentation step and the parsing step on the web page. For our test sentence, the segmentation it shows is '你好，我心情不错今天，你呢？'.

If I run this segmentation string through the xinhuaFactored model locally, my output matches the online parser exactly.

So we need to run our text through a word segmenter before running it through the parser. The FAQ recommends the Stanford Word Segmenter, which is probably what the online parser is using anyway: http://nlp.stanford.edu/software/segmenter.shtml.

(As the FAQ mentions, the parser also contains a model xinhuaFactoredSegmenting which does an approximate segmentation as part of the parsing call. However, it calls this method "reasonable, but not excellent", and the output doesn't match the online parser anyway, which is our standard.)

Thank you so much! The second code segment worked, but the tree.pformat().encode('utf-8') led to a error of "AttributeError: 'Tree' object has no attribute 'pformat'" Do you know how to fix it? Again, thank you so much for the helpful answer. — Kevin Zhao, Mar 23 '15 at 18:10
It worked! Thanks! But when I try to put the following code, the output from my Python code is different from what the website produces. — Kevin Zhao, Mar 23 '15 at 20:30
test1 = '你好，我心情不错今天，你呢？'.decode('utf-8') for tree in parser.raw_parse(test1): print tree.pprint().encode('utf-8') it prints: (ROOT (IP (VP (VV 你好，我心情不错今天，你呢？)))) while the website prints: (ROOT (CP (IP (VP (VP (VV 你好)) (PU ，) (VP (NP (NP (PN 我)) (NP (NN 心情))) (VP (VA 不错)) (NP (NT 今天))) (PU ，) (VP (NP (PN 你))))) (SP 呢) (PU ？))) — Kevin Zhao, Mar 23 '15 at 20:32
That does seem like a big problem. :) Please see Update 3 above, which explains what's going on. I also had to change from using `chineseFactored` to `xinhuaFactored` to match the online parser for this example, so I changed that throughout my answer too. — leekaiinthesky, Mar 24 '15 at 10:16

Using The Stanford Parser in Python on Chinese text not working

1 Answers1