9

Hi I am trying to learn NLTK. I am new to Python as well. I am trying the following.

>>import nltk
>>nltk.pos_tag(nltk.word_tokenize("John lived in China"))

I get the following error message

Traceback (most recent call last): File "", line 1, in nltk.pos_tag(nltk.word_tokenize("John lived in California")) File "C:\Python34\lib\site-packages\nltk\tag__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I have downloaded all models available (including the maxent_treebank_pos_tagger)

The default system encoding is UTF-8

>>sys.getdefaultencoding()

I opened up the data.py file and this is the content available.

774# Load the resource.
775    opened_resource = _open(resource_url)
776if format == 'raw':
777            resource_val = opened_resource.read()
778        elif format == 'pickle':
779            resource_val = pickle.load(opened_resource)
780        elif format == 'json':
781            import json

What am I doing wrong here?

Niranjan Sonachalam
  • 1,545
  • 1
  • 20
  • 29

5 Answers5

16

OK, I found the solution to it. Looks like a problem in the source itself. Check here

I opened up data.py and modified line 779 as below

resource_val = pickle.load(opened_resource) #old
resource_val = pickle.load(opened_resource, encoding='iso-8859-1') #new
Niranjan Sonachalam
  • 1,545
  • 1
  • 20
  • 29
2

The fundamental problem is that NLTK 2.x is not supported for Python 3, and NLTK 3 is an on-going effort to release a fully Python 3-compatible version.

The simple workaround is to download the latest NLTK 3.x and use that instead.

If you want to participate in finishing the port to Python 3, you probably need a deeper understanding of the differences between Python 2 and Python 3; in particular, for this case, how the fundamental string type in Python 3 is a Unicode string (u'...'), not a byte string (Python 3 b'...') like in Python 2. See also http://nedbatchelder.com/text/unipain.html

FWIW, see also https://github.com/nltk/nltk/issues/169#issuecomment-12778108 for a fix identical to yours. The bug you linked to has already been fixed in NLTK 3.0 (presumably by a fix to the actual data files instead; I think in 3.0a3).

tripleee
  • 175,061
  • 34
  • 275
  • 318
2

I'm coming to this late, but in case it helps someone else who comes across this, what worked for me was to decode the text before putting it into word_tokenize, i.e.:

raw_text = "John lived in China"
to_tokenize = raw_text.decode('utf-8')
tokenized = nltk.word_tokenize(to_tokenize)
output = nltk.pos_tag(tokenized)

Maybe that'll work for someone else!

zSand
  • 46
  • 1
  • 2
0

Using Python 3.4 and NLTK 3 you can fix this by doing:

f = open('myClassifier_or_X_trained_model',mode='rb')
whereIuseTheModel = pickle.load(f,encoding='UTF-8')

Note that the mode to open is rb and encoding='uft-8'. This solution doesn't require to edit data.py.

honk
  • 9,137
  • 11
  • 75
  • 83
Yekatandilburg
  • 187
  • 1
  • 2
0

I tried all the answers but nothing worked, so followed the following 2 links and then

https://github.com/nltk/nltk/issues/169

https://github.com/nltk/nltk_data/tree/gh-pages/packages/taggers

  • downloaded the maxent_treebank_pos_tagger.zip file.
  • unzipped it and copied the english.pickle file and replaced the english.pickle files already present in my nltk_data tags folder --> C:\nltk_data\taggers\maxent_treebank_pos_tagger with the new one.
  • I also replaced the one in the folder C:\nltk_data\taggers\maxent_treebank_pos_tagger\PY3 with the new one.

PS: I do not know what else might be affected but for now I'm ok.

ryanyuyu
  • 6,366
  • 10
  • 48
  • 53
  • If you could provide some more information about the links you provided (like some search terms), it would make your answer more robust, in case the links get edited/deleted. – ryanyuyu Jan 23 '15 at 23:12