2

I have the below HTML from a view:source of a webpage

<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>

using below xpath expression I am trying to get the LinkedIn URL parsed but couldn't able to do it.

from lxml import html, etree

asd = """<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>"""

html.fromstring(asd.replace("xlink:href","xlinkhref")).xpath('(//a//div//svg//use[contains(@xlinkhref,"linkedin")])//@href')

output is

[]

Due to lxml.etree.XPathEvalError: Undefined namespace prefix errors, I had to replace the ":", but still couldn't understand where I am doing things wrong, Any suggestions highly appreciated.

Using re I able to parse what i need , but still couldn't find solution with lxml

[each.split('"')[0] for each in re.findall('<a target="_blank" rel="nofollow" href="(.+?)</a>',asd,re.DOTALL) if '/sprite.svg#linkedin' in each][0].split('?')[0]
Shekhar Samanta
  • 875
  • 2
  • 12
  • 25
  • Read https://lxml.de/xpathxslt.html#namespaces-and-prefixes – Tomalak Aug 24 '18 at 15:18
  • 1
    @Tomalak, I tried this html.fromstring(asd).xpath('(//a/div/svg/use[@xlink:href="/sprite.svg#linkedin"])/@href',namespaces={'xlink':'https://www.w3.org/2000/svg'}) it doesn't work – Shekhar Samanta Aug 24 '18 at 15:26

1 Answers1

2

I've never really used lxml's html; only etree. It (html) seems to treat namespaces a little differently than etree.

In your sample data the namespace prefix xref isn't bound to a namespace uri. Even if I add the declaration to bind it (xmlns:xlink="http://www.w3.org/1999/xlink") it doesn't seem to work the same as etree (adding the "namespaces" dict arg to xpath()).

Another example is the use element. It's in the default namespace https://www.w3.org/2000/svg but if I add namespaces={"svg": "https://www.w3.org/2000/svg"} and use the prefix in the xpath (svg:use) it doesn't select anything. It only works if I use use without a prefix.

If your actual data is well-formed, including binding the xlink prefix, you can use etree and map the prefixes.

If not, you'll have to stick to html and use some local-name() trickery. (Something else that's weird is that html includes the prefix in the local name so you have to match xlink:href instead of just href.)

Here's an example of both...

from lxml import html, etree

# --------------------- TEST USING html --------------------------------------------------------------------------------

# The xlink namespace prefix is not bound to a namespace uri so this is not namespace well-formed.
asd = """<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>"""

href = html.fromstring(asd).xpath('//a[.//use/@*[local-name()="xlink:href"][contains(.,"linkedin")]]/@href')[0]
print(f"Results using html:  {href}")

# --------------------- TEST USING etree -------------------------------------------------------------------------------

# Modified to include binding of xlink namespace prefix to a namespace uri to make it well formed.
asd2 = """<html xmlns:xlink="http://www.w3.org/1999/xlink">
<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>
</html>"""

namespaces = {"svg": "https://www.w3.org/2000/svg", "xlink": "http://www.w3.org/1999/xlink"}
href2 = etree.fromstring(asd2).xpath('//a[.//svg:use[contains(@xlink:href,"linkedin")]]/@href', namespaces=namespaces)[0]
print(f"Results using etree: {href2}")

This outputs the following...

Results using html:  http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&utm_medium=AdVendorPage&utm_content=https://www.thalamus.co/buyers/014-media
Results using etree: http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&utm_medium=AdVendorPage&utm_content=https://www.thalamus.co/buyers/014-media
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95