2

I am using NLTK 3.0 with Python 3.4 and cannot do POS tagging because of the following error: I have read all similar posts related to similar problems, but could not find a way to solve the problem. most of the posts mention that upgrading to NLTK 3.0 will solve the problem but I already have NLTK 3.0. According to these posts a change in the nltk's data.py solves the problem but NLTK people discourage doing that. Here is my code:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad."))

and here is the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

Is there any way to do it without manipulating data.py? Any idea would be appreciated.

Community
  • 1
  • 1
Mohammadreza
  • 450
  • 2
  • 4
  • 14
  • Did you download NLTK data using the provided interface (`nltk.download()` or something like this), and not by hand (in which case you might have data for Py2)? I have exactly the same setup as yours and can't reproduce your error. – michaelmeyer Oct 27 '14 at 08:33

2 Answers2

1

In the current version of nltk_data, they provide two versions of the pickle files: one for Python 2 and one for Python 3. For example, there is one english.pickle at nltk_data/taggers/maxent_treebank_pos_tagger and one at nltk_data/taggers/maxent_treebank_pos_tagger/PY3. The newest nltk handles this automatically by a decorator py3_data.

In short, if you download the newest nltk_data, but don't have the newest nltk, it may load the wrong pickle file, raising the UnicodeDecodeError exception.

Note: suppose you already have the newest nltk, you may encounter some path error where you can see two "PY3"'s in the path of the pickle file. This may mean some developers were not aware of the py3_data and have handled the path redundantly. You can remove/revert the redundancy by yourself. See this pull request for an example.

Ziyuan
  • 4,215
  • 6
  • 48
  • 77
0

I don't have any problems with python3

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

Check that you have the utf-8 as your sys.defaultencoding:

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

If not there's a few things that you can do to explicitly specify the python's encoding, see Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thank you alvas. But reload(sys) and sys.setdefaultencoding will throw me "undefined variable!". It seems that my python does not know functions "reload" and "setdefaultencoding". Do you know the reason or is there any other way to change the default encoding? Thanks in advance – Mohammadreza Oct 29 '14 at 00:17
  • what is your current output when you do `import sys; sys.getdefaultecnoding()` ? – alvas Oct 29 '14 at 08:40
  • The output error is: sys.setdefaultencoding("utf-8") AttributeError: 'module' object has no attribute 'setdefaultencoding' – Mohammadreza Oct 30 '14 at 00:09
  • Sorry for misunderstanding. The output is "utf-8". – Mohammadreza Oct 30 '14 at 07:17