There are two (ok, three... see "Update 3" below for the third) separate things going on:
1) Your code is returning two trees (two ROOT
s), but you only expect to get one. This is happening because raw_parse_sents
expects a list of sentences, not a single sentence, and if you give it a string, it is parsing each character in the string as if it were its own sentence and returning a list of one-character trees. So either pass raw_parse_sents
a list, or use raw_parse
instead.
2) You haven't specified a model_path
, and the default is English. There are five options for Chinese, but it looks like this one matches the online parser:
parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
Combining these two changes, I am able to match the online parser (I also had to cast the returned listiterator to a list in order to match your output format):
from nltk.parse import stanford
s = '你好'.decode('utf-8')
print s.encode('utf-8')
parser = stanford.StanfordParser(model_path='edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz', path_to_jar='stanford-parser.jar',path_to_models_jar='stanford-parser-3.5.1-models.jar')
print list(parser.raw_parse(s))
> 你好
> [Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4f60\u597d'])])])])]
Update 1:
I realized you might be looking for an output format more like the one on the website as well, in which case this works:
for tree in parser.raw_parse(s):
print tree # or print tree.pformat().encode('utf-8') to force an encoding
Update 2:
Apparently if your version of NLTK is earlier than 3.0.2, Tree.pformat()
was Tree.pprint()
. From https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0:
Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):
- classify.decisiontree.DecisionTreeClassifier.pp → pretty_format
- metrics.confusionmatrix.ConfusionMatrix.pp → pretty_format
- sem.lfg.FStructure.pprint → pretty_format
- sem.drt.DrtExpression.pretty → pretty_format
- parse.chart.Chart.pp → pretty_format
- Tree.pprint() → pformat
- FreqDist.pprint → pformat
- Tree.pretty_print → pprint
- Tree.pprint_latex_qtree → pformat_latex_qtree
Update 3:
I am now trying to match the output for the sentence in your comment, '你好,我心情不错今天,你呢?'
.
I referred to the Stanford Parser FAQ extensively while writing this response and suggest you check it out (especially "Can you give me some help in getting started parsing Chinese?"). Here's what I've learned:
In general, you need to "segment" Chinese text into words (consisting of one or more characters) separated by spaces before parsing it. The online parser does this, and you can see the output of both the segmentation step and the parsing step on the web page. For our test sentence, the segmentation it shows is '你好 , 我 心情 不错 今天 , 你 呢 ?'
.
If I run this segmentation string through the xinhuaFactored
model locally, my output matches the online parser exactly.
So we need to run our text through a word segmenter before running it through the parser. The FAQ recommends the Stanford Word Segmenter, which is probably what the online parser is using anyway: http://nlp.stanford.edu/software/segmenter.shtml.
(As the FAQ mentions, the parser also contains a model xinhuaFactoredSegmenting
which does an approximate segmentation as part of the parsing call. However, it calls this method "reasonable, but not excellent", and the output doesn't match the online parser anyway, which is our standard.)