Python + lxml + etree Encoding issue

Question

I'm trying to parse some pages by using this code:

import urllib.request
import requests
from lxml import etree

s = requests.session()
s.headers.update({
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) 
Gecko/20100101 Firefox/45.0'
})
results = open("res.txt", "w")
for i in range(510077, 2780673):
    results = open("res.txt", "a")
    print(i)
    url = "url" + str(i) + "&print=true"
try:
    content = s.get(url).text
    tree = etree.HTML(content)
    a = str(tree.xpath("//*[@class='prob_nums']")[0].text)
    b = etree.tostring(tree.xpath("//*[@class='pbody']")[0])
    c = etree.tostring(tree.xpath("//*[@class='nobreak solution']")[0])
    results.writelines("%s    %s    %s" % (a, b, c))
    results.close()
except Exception:
    print("error")

But have a problem with output: (fragment)

 <p class="left_margin">&#1053;&#1072; &#1076;&#1086;&#1089;&#1082;&#1077; &#1085;&#1072;&#173;&#1087;&#1080;&#173;&#1089;&#1072;

How to convert these symbols to normal text? Thank you

you may use `except Exception as e:` `print(e)` instead, the error message can help — PRMoureu, Jul 05 '17 at 18:02
Try to print the exception message clearly https://stackoverflow.com/questions/1715198/exception-message-python-2-6 — Pavan Nath, Jul 05 '17 at 18:06
but there is no error. it is a part of normal output without any errors. it is just a problem with encoding, that i don't know how to solve — Nikita Gushchin, Jul 05 '17 at 18:06
You are printing the xml using tostring() so that is what you get, i.e. xml. Maybe what you need is to see the decoded (unescaped) version? Have you tried reading to docs for xml.sax.saxutils escape() and unescape(). — DisappointedByUnaccountableMod, Jul 05 '17 at 18:08
Does it work if you use `lxml.html` instead of `lxml.etree`? See this question: https://stackoverflow.com/q/19163082/407651. — mzjn, Jul 05 '17 at 20:14
I'm tried. But i faced some troubles with bytes/str convertation + this isn't solved problem. `question = html.tostring(content.xpath("//*[@class='pbody']")[0], method='html', encoding='utf-8')` output: `
\xd0\xe5\xf8\xe8\xf2\xe5 \xf1\xe8\xad\x...` May be problem with using requests (but i need headers anyway - as a found out lxml.html do no support them, but i'm not sure). `data = s.get(url).text content = html.fromstring(data.decode())` — Nikita Gushchin, Jul 06 '17 at 16:25

Python + lxml + etree Encoding issue

0 Answers0