9

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:

fromstring(response).text_content().encode('ascii', 'replace')

solves the problem.

Here it's my code:

from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
                   'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')   
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()

print encoding
print fromstring(response).text_content()

Output:

utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)

What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.

UPDATE:

Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?

UPDATE2:

It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.

Regards

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • 1
    Does `print u"\u00A9"` inside your script also produce the error? – jfs Jun 16 '12 at 01:11
  • Yes. UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128) :-) – r_31415 Jun 16 '12 at 01:12
  • you could set PYTHONIOENCODING to whatever character encoding SublimeText accepts. – jfs Jun 16 '12 at 01:24
  • Where do I do that?. Is it related to export PYTHONIOENCODING='utf-8'? – r_31415 Jun 16 '12 at 01:36
  • yes. It is an environment variable, see http://wiki.python.org/moin/PrintFails Note: the output encoding has nothing to do with the original encoding of html page – jfs Jun 16 '12 at 03:15
  • See here: http://bit.ly/unipain – Daenyth Jun 16 '12 at 19:53
  • I updated SublimeText just today and I'm not getting this issue. Have you tried this with the latest update (2.0 final)? What platform are you running on? – schlamar Jun 27 '12 at 11:59
  • I just updated but I haven't tested this. I will do that. Thanks for the notice! – r_31415 Jun 27 '12 at 21:26
  • @ms4py I have tested it on Sublime Text 2.0 (final) but unfortunately I still receive "UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)". – r_31415 Jun 27 '12 at 21:30
  • Tested both in Ubuntu 12.04 and Windows 7. – r_31415 Jun 28 '12 at 05:08

3 Answers3

5

Can you try wrapping your string with repr()? This article might help.

print repr(fromstring(response).text_content())
ChipJust
  • 1,376
  • 12
  • 20
3

As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:

import codecs
output_file = codecs.open('filename.txt','w','utf8')

I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.

Justin.Wood
  • 695
  • 4
  • 10
0

Based on your first update I would say that the terminal told Python to output utf-8 and SublimeText made clear it expects ascii. So I think the solution will be in finding the right settings in SublimeText.

However, if you cannot change what SublimeText expects it is better to use the encode function like you already did in a separate function.

def smartprint( text ) :
    if sys.stdout.encoding == None :
        print text
    else :
        print text.encode( sys.stdout.encoding , 'replace' )

You can use this function instead of print. Keep in mind that your program's output when run in SublimeText differs from Terminal. Because of the replace accented characters will loose their accents when this code is run in SublimeText, e.g. é will be shown as e.

Marco de Wit
  • 2,686
  • 18
  • 22