I'm trying to parse some pages by using this code:
import urllib.request
import requests
from lxml import etree
s = requests.session()
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0)
Gecko/20100101 Firefox/45.0'
})
results = open("res.txt", "w")
for i in range(510077, 2780673):
results = open("res.txt", "a")
print(i)
url = "url" + str(i) + "&print=true"
try:
content = s.get(url).text
tree = etree.HTML(content)
a = str(tree.xpath("//*[@class='prob_nums']")[0].text)
b = etree.tostring(tree.xpath("//*[@class='pbody']")[0])
c = etree.tostring(tree.xpath("//*[@class='nobreak solution']")[0])
results.writelines("%s %s %s" % (a, b, c))
results.close()
except Exception:
print("error")
But have a problem with output: (fragment)
<p class="left_margin">На доске на­пи­са
How to convert these symbols to normal text? Thank you
\xd0\xe5\xf8\xe8\xf2\xe5 \xf1\xe8\xad\x...` May be problem with using requests (but i need headers anyway - as a found out lxml.html do no support them, but i'm not sure). `data = s.get(url).text content = html.fromstring(data.decode())`