Python NLTK Word Tokenize UnicodeDecode Error

Question

I get the error when trying the below code. I try to read from a text file and tokenize the words using nltk. Any ideas? The text file can be found here

from nltk.tokenize import word_tokenize
short_pos = open("./positive.txt","r").read()
#short_pos = short_pos.decode('utf-8').lower()
short_pos_words = word_tokenize(short_pos)

Error:

Traceback (most recent call last):
  File "sentimentAnalysis.py", line 19, in <module>
    short_pos_words = word_tokenize(short_pos)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)

Thanks for your support.

Possible duplicate of [Tokenizing unicode using nltk](http://stackoverflow.com/questions/9228202/tokenizing-unicode-using-nltk) — Moses Koledoye, Jul 27 '16 at 16:04
Or try: `short_pos = open("./positive.txt","rb").read().decode('utf-8')` — Moses Koledoye, Jul 27 '16 at 16:04
You can also read your file in different encodings using `codecs.open()`, cf [docs](https://docs.python.org/2/library/codecs.html#codecs.open) — patrick, Jul 27 '16 at 16:51
Tried both the above suggestion but still getting errors. UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 4645: invalid continuation byte — Sourav Chatterjee, Jul 27 '16 at 18:08

score 1 · Accepted Answer · answered Jul 27 '16 at 21:55

Looks like this text is encoded in Latin-1. So this works for me:

import codecs    
with codecs.open("positive.txt", "r", "latin-1") as inputfile:
        text=inputfile.read()

    short_pos_words = word_tokenize(text)   
    print len(short_pos_words)

You can test for different encodings by e.g. looking at the file in a good editor like TextWrangler. You can

1) open the file in different encodings to see which one looks good and

2) look at the character that caused the issue. In your case, that is the character in position 4645 - which happens to be an accented word from a Spanish review. That is not part of Ascii, so that doesn't work; it's also not a valid codepoint in UTF-8.

@SouravChatterjee Good to hear! In that case, consider accepting an answer as correct. — patrick, Jul 28 '16 at 13:49

score 0 · Answer 2 · answered Jul 28 '16 at 00:31

0

Your file is encoded using "latin-1".

from nltk.tokenize import word_tokenize
import codecs   

with codecs.open("positive.txt", "r", "latin-1") as inputfile:
    text=inputfile.read()

short_pos_words = word_tokenize(text)   
print short_pos_words

answered Jul 28 '16 at 00:31

RAVI

3,143
4
25
38

Yes that thing worked. Thanks for your input , much appreciated! – Sourav Chatterjee Jul 28 '16 at 07:16

Python NLTK Word Tokenize UnicodeDecode Error

2 Answers2