4

I've written a script in python in combination with xpath to scrape links from a site with xml content. As I've never worked with xml, so I can't figure out where I'm making mistakes. Thanks in advance to provide me with a workaround. Here is what I'm trying with:

import requests
from lxml import html

response = requests.get("https://drinkup.london/sitemap.xml").text
tree = html.fromstring(response)
for item in tree.xpath('//div[@class="expanded"]//span[@class="text"]'):
    print(item)

xml content within which the links are:

<div xmlns="http://www.w3.org/1999/xhtml" class="collapsible" id="collapsible4"><div class="expanded"><div class="line"><span class="button collapse-button"></span><span class="html-tag">&lt;url&gt;</span></div><div class="collapsible-content"><div class="line"><span class="html-tag">&lt;loc&gt;</span><span class="text">https://drinkup.london/</span><span class="html-tag">&lt;/loc&gt;</span></div></div><div class="line"><span class="html-tag">&lt;/url&gt;</span></div></div><div class="collapsed hidden"><div class="line"><span class="button expand-button"></span><span class="html-tag">&lt;url&gt;</span><span class="text">...</span><span class="html-tag">&lt;/url&gt;</span></div></div></div>

The error thrown upon execution is given below:

    value = etree.fromstring(html, parser, **kw)
  File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
  File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119053)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
SIM
  • 21,997
  • 5
  • 37
  • 109
  • You are assigning your `response` variable to the `text` attribute from `requests.get`, which will be a unicode string, hence the error. Use the `content` attribute instead of `text` – peterfields Aug 07 '17 at 20:14

1 Answers1

3

Switch to .content which returns bytes instead of .text which returns unicode:

import requests
from lxml import html


response = requests.get("https://drinkup.london/sitemap.xml").content
tree = html.fromstring(response)
for item in tree.xpath('//url/loc/text()'):
    print(item)

Note the fixed XPath expression as well.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • You are just awesome sir alecxe. Whenever I'm in trouble, you are there. It worked like magic. Thanks a zillion. – SIM Aug 07 '17 at 20:21