NLTK 3 POS_TAG throws UnicodeDecodeError

Question

Hi I am trying to learn NLTK. I am new to Python as well. I am trying the following.

>>import nltk
>>nltk.pos_tag(nltk.word_tokenize("John lived in China"))

I get the following error message

Traceback (most recent call last): File "", line 1, in nltk.pos_tag(nltk.word_tokenize("John lived in California")) File "C:\Python34\lib\site-packages\nltk\tag__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I have downloaded all models available (including the maxent_treebank_pos_tagger)

The default system encoding is UTF-8

>>sys.getdefaultencoding()

I opened up the data.py file and this is the content available.

774# Load the resource.
775    opened_resource = _open(resource_url)
776if format == 'raw':
777            resource_val = opened_resource.read()
778        elif format == 'pickle':
779            resource_val = pickle.load(opened_resource)
780        elif format == 'json':
781            import json

What am I doing wrong here?

score 16 · Accepted Answer · answered Aug 31 '14 at 08:16

16

OK, I found the solution to it. Looks like a problem in the source itself. Check here

I opened up data.py and modified line 779 as below

resource_val = pickle.load(opened_resource) #old
resource_val = pickle.load(opened_resource, encoding='iso-8859-1') #new

answered Aug 31 '14 at 08:16

Niranjan Sonachalam

1,545
1
20
29

2

-1 Hard-coding an obsolete legacy encoding is hardly the way to go. – tripleee Aug 31 '14 at 08:26
1

Any other solution you can provide? – Niranjan Sonachalam Aug 31 '14 at 08:45
@tripleee something is better than nothing! – Arnab Chakraborty Jan 03 '15 at 16:56
thanks. much better than having to install a whole new version of nltk! – modarwish Jan 08 '15 at 09:00
Thank you so much! If a newer version of nltk than I have is out yet, pip doesn't know about it. This fixed the problem! – Matthew Jan 09 '15 at 22:03
Thanks a ton...This really helped !! – dhvlnyk Jan 20 '16 at 04:36

tripleee · Answer 2 · 2014-09-02T11:02:39.077

The fundamental problem is that NLTK 2.x is not supported for Python 3, and NLTK 3 is an on-going effort to release a fully Python 3-compatible version.

The simple workaround is to download the latest NLTK 3.x and use that instead.

If you want to participate in finishing the port to Python 3, you probably need a deeper understanding of the differences between Python 2 and Python 3; in particular, for this case, how the fundamental string type in Python 3 is a Unicode string (u'...'), not a byte string (Python 3 b'...') like in Python 2. See also http://nedbatchelder.com/text/unipain.html

FWIW, see also https://github.com/nltk/nltk/issues/169#issuecomment-12778108 for a fix identical to yours. The bug you linked to has already been fixed in NLTK 3.0 (presumably by a fix to the actual data files instead; I think in 3.0a3).

score 2 · Answer 3 · answered Mar 30 '15 at 00:55

I'm coming to this late, but in case it helps someone else who comes across this, what worked for me was to decode the text before putting it into word_tokenize, i.e.:

raw_text = "John lived in China"
to_tokenize = raw_text.decode('utf-8')
tokenized = nltk.word_tokenize(to_tokenize)
output = nltk.pos_tag(tokenized)

Maybe that'll work for someone else!

score 0 · Answer 4 · edited Jan 11 '15 at 09:16

0

Using Python 3.4 and NLTK 3 you can fix this by doing:

f = open('myClassifier_or_X_trained_model',mode='rb')
whereIuseTheModel = pickle.load(f,encoding='UTF-8')

Note that the mode to open is rb and encoding='uft-8'. This solution doesn't require to edit data.py.

edited Jan 11 '15 at 09:16

honk

9,137
11
75
83

answered Jan 11 '15 at 08:53

Yekatandilburg

187
1
2

score 0 · Answer 5 · edited Jan 23 '15 at 23:35

0

I tried all the answers but nothing worked, so followed the following 2 links and then

https://github.com/nltk/nltk/issues/169

https://github.com/nltk/nltk_data/tree/gh-pages/packages/taggers

downloaded the maxent_treebank_pos_tagger.zip file.
unzipped it and copied the english.pickle file and replaced the english.pickle files already present in my nltk_data tags folder --> C:\nltk_data\taggers\maxent_treebank_pos_tagger with the new one.
I also replaced the one in the folder C:\nltk_data\taggers\maxent_treebank_pos_tagger\PY3 with the new one.

PS: I do not know what else might be affected but for now I'm ok.

edited Jan 23 '15 at 23:35

ryanyuyu

6,366
10
48
53

answered Jan 23 '15 at 23:00

Vignesh

1

If you could provide some more information about the links you provided (like some search terms), it would make your answer more robust, in case the links get edited/deleted. – ryanyuyu Jan 23 '15 at 23:12

NLTK 3 POS_TAG throws UnicodeDecodeError

5 Answers5

Linked