1

I'm trying to parse some pages by using this code:

import urllib.request
import requests
from lxml import etree

s = requests.session()
s.headers.update({
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) 
Gecko/20100101 Firefox/45.0'
})
results = open("res.txt", "w")
for i in range(510077, 2780673):
    results = open("res.txt", "a")
    print(i)
    url = "url" + str(i) + "&print=true"
try:
    content = s.get(url).text
    tree = etree.HTML(content)
    a = str(tree.xpath("//*[@class='prob_nums']")[0].text)
    b = etree.tostring(tree.xpath("//*[@class='pbody']")[0])
    c = etree.tostring(tree.xpath("//*[@class='nobreak solution']")[0])
    results.writelines("%s    %s    %s" % (a, b, c))
    results.close()
except Exception:
    print("error")

But have a problem with output: (fragment)

 <p class="left_margin">&#1053;&#1072; &#1076;&#1086;&#1089;&#1082;&#1077; &#1085;&#1072;&#173;&#1087;&#1080;&#173;&#1089;&#1072;

How to convert these symbols to normal text? Thank you

  • you may use `except Exception as e:` `print(e)` instead, the error message can help – PRMoureu Jul 05 '17 at 18:02
  • Try to print the exception message clearly https://stackoverflow.com/questions/1715198/exception-message-python-2-6 – Pavan Nath Jul 05 '17 at 18:06
  • but there is no error. it is a part of normal output without any errors. it is just a problem with encoding, that i don't know how to solve – Nikita Gushchin Jul 05 '17 at 18:06
  • You are printing the xml using tostring() so that is what you get, i.e. xml. Maybe what you need is to see the decoded (unescaped) version? Have you tried reading to docs for xml.sax.saxutils escape() and unescape(). – DisappointedByUnaccountableMod Jul 05 '17 at 18:08
  • Mmm... thanks. Using of html.parser.unescape helped. – Nikita Gushchin Jul 05 '17 at 18:21
  • Does it work if you use `lxml.html` instead of `lxml.etree`? See this question: https://stackoverflow.com/q/19163082/407651. – mzjn Jul 05 '17 at 20:14
  • I'm tried. But i faced some troubles with bytes/str convertation + this isn't solved problem. `question = html.tostring(content.xpath("//*[@class='pbody']")[0], method='html', encoding='utf-8')` output: `

    \xd0\xe5\xf8\xe8\xf2\xe5 \xf1\xe8\xad\x...` May be problem with using requests (but i need headers anyway - as a found out lxml.html do no support them, but i'm not sure). `data = s.get(url).text content = html.fromstring(data.decode())`

    – Nikita Gushchin Jul 06 '17 at 16:25

0 Answers0