UnicodeDecodeError in textblob tutorial

Question

I'm trying to run through the TextBlob tutorial in Windows (using Git Bash shell) with Python 3.3.

I've installed textblob and nltk as well as any dependencies.

The Python code is:

from text.blob import TextBlob

wiki = TextBlob("Python is a high-level, general-purpose programming language.")
tags = wiki.tags

I'm getting the following error

Traceback (most recent call last):
File "textblob.py", line 4, in <module> 
  tags = wiki.tags
File "c:\Python33\lib\site-packages\text\decorators.py", line 18, in __get__ 
  value = obj.__dict__[self.func.__name__] = self.func(obj)
File "c:\Python33\lib\site-packages\text\blob.py", line 357, in pos_tags 
  for word, t in self.pos_tagger.tag(self.raw)
File "c:\Python33\lib\site-packages\text\taggers.py", line 40, in tag
  return pattern_tag(sentence, tokenize)
File "c:\Python33\lib\site-packages\text\en.py", line 115, in tag
  for sentence in parse(s, tokenize, True, False, False, False, encoding).split():
File "c:\Python33\lib\site-packages\text\en.py", line 99, in parse
  return parser.parse(unicode(s), *args, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1213, in parse
  s[i] = self.find_tags(s[i], **kwargs)
File "c:\Python33\lib\site-packages\text\en.py", line 49, in find_tags
  return _Parser.find_tags(self, tokens, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1161, in find_tags
  map = kwargs.get(     "map", None))
File "c:\Python33\lib\site-packages\text\text.py", line 967, in find_tags
  tagged.append([token, lexicon.get(token, i==0 and lexicon.get(token.lower()) or   None)])
File "c:\Python33\lib\site-packages\text\text.py", line 98, in get
  return self._lazy("get", *args)
File "c:\Python33\lib\site-packages\text\text.py", line 79, in _lazy
  self.load()
File "c:\Python33\lib\site-packages\text\text.py", line 367, in load
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 367, in <genexpr>
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 346, in _read
  for line in f:
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
  return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>

Any idea what is wrong here? Adding a 'u' before the string didn't help.

I ran through the tutorial quickly and it worked OK on my OS X machine using Python 3.3. Do you perhaps have an old version of TextBlob? It looks like a similar problem was just fixed and released: https://github.com/sloria/TextBlob/issues/15 — Rachel Sanders, Sep 24 '13 at 17:25
No luck. I'm using 0.6.3 which I believe is the latest. I did a pip --force-reinstall and did notice a libyaml error when installing pyyaml. The install did continue though so I'm not sure this is a serious issue. — sgoldber, Sep 24 '13 at 17:49
In continuing to mess around with this, I ran through a brief tutorial from the front page of the [nltk site](http://nltk.org/) and hit a very similar error. Cloning from the master repo on github solved the problem though. Maybe I need to try something similar with textblob. — sgoldber, Sep 24 '13 at 18:24
do you get an error when you try: `sentence = "Python is a high-level, general-purpose programming language.".encode('utf8')` — alvas, Sep 25 '13 at 10:51
I think the problem is in `text/_text.py` line 339 (on TextBlob 0.7.0). `en-lexicon.txt`, `en-context.txt`, and `en-entities.txt` are being opened without specifying the encoding of the files, so the platform default is being used (apparently `cp1252` in your case). I will have to look into what encoding those text files and open the files correctly. Github issue [here](https://github.com/sloria/TextBlob/issues/30). Thanks for reporting this. In the mean time you can try using the `NLTKTagger`. Instructions [here](https://textblob.readthedocs.org/en/latest/advanced_usage.html#pos-taggers) — Steve L, Sep 26 '13 at 05:16
Ok, made some changes to the encoding handling. Can you try installing the development version and see if you get the same error? `pip install -U git+https://github.com/sloria/TextBlob.git@dev` — Steve L, Sep 26 '13 at 19:13
With the new install from Steve L I get the same error but it has changed from stating "position 16" to "position 19". — sgoldber, Sep 27 '13 at 19:58
OK. I've made another change so that the txt files are opened in utf8 mode. Can you try again? Also, let's continue this discussion here: https://github.com/sloria/TextBlob/issues/30 — Steve L, Sep 28 '13 at 16:41
Just to be complete here. The last change that Steve L made fixes the issue. — sgoldber, Sep 30 '13 at 16:59

score 3 · Accepted Answer · answered Sep 30 '13 at 20:30

Release 0.7.1 fixes this issue, which means it's time for a

$ pip install -U textblob

The problem was that the en-lexicon.txt file used for part-of-speech tagging opened the file using Windows' default platform encoding, cp1252. The file apparently had characters that Python could not decode from this encoding. This was fixed by explicitly opening the file in utf-8 mode.

UnicodeDecodeError in textblob tutorial

1 Answers1