stanford-dependency parser with NLTK :UnicodeDecodeError:

Question

I am trying to run the following lines of code:

import os
os.environ['JAVAHOME'] = 'path/to/java.exe'
os.environ['STANFORD_PARSER'] = 'path/to/stanford-parser.jar'
os.environ['STANFORD_MODELS'] = 'path/to/stanford-parser-3.8.0-models.jar'

from nltk.parse.stanford import StanfordDependencyParser
dep_parser = StanfordDependencyParser(model_path="path/to/englishPCFG.ser.gz")
sentence = "sample sentence ..."

# Dependency Parsing:
print("Dependency Parsing:")
print([parse.tree() for parse in dep_parser.raw_parse(sentence)])

and at the line:

print([parse.tree() for parse in dep_parser.raw_parse(sentence)])

I get the following issues:

Traceback (most recent call last): File "C:/Users/Norbert/PycharmProjects/untitled/StanfordDependencyParser.py", line 21, in print([parse.tree() for parse in dep_parser.raw_parse(sentence)]) File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 134, in raw_parse return next(self.raw_parse_sents([sentence], verbose)) File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 152, in raw_parse_sents return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose)) File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\parse\stanford.py", line 218, in _execute stdout=PIPE, stderr=PIPE) File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 135, in java print(_decode_stdoutdata(stderr)) File "C:\Users\Norbert\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\internals.py", line 737, in _decode_stdoutdata return stdoutdata.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 3097: invalid start byte

Any idea what could be wrong ? I am not even dealing with any non-utf-8 text.

Is "sample sentence ..." the sentence under which you are seeing the error? — gimg1, Jul 27 '17 at 19:55
@gimg1 no, I just put that as a placeholder. I tried about 5 different sentences containing just normal a-zA-Z letters and gives me the same error — Uther Pendragon, Jul 27 '17 at 21:09
Can you try encoding the string to utf-8 just to be sure there is no character in their causing the error? `sentence.encode('utf-8').strip()` — gimg1, Jul 28 '17 at 23:30

Joe9008 · Answer 1 · 2018-04-05T13:22:50.997

1

I can print a few things by doing this, maybe is not what you wanted but is a start.

print("Dependency Parsing:")
result = dependency_parser.raw_parse(sentence)
#print (next(result))
dep = next(result)
print (list(dep.triples()))

Uncomment the line -> print(next(result)) if you want to see the entire output.

edited Apr 05 '18 at 13:22

answered Mar 22 '18 at 08:51

Joe9008

645
7
14

stanford-dependency parser with NLTK :UnicodeDecodeError:

1 Answers1