3

I got a strange bug with lxml:

>>> s = '<html><head><noscript></noscript><script></script><meta></head></html>' 
>>> root = lxml.html.fromstring(s)
>>> root.xpath('/html/head/meta')
>>> root.xpath('/html/body/meta')
[<Element meta at 0x2a92788>]

meta tag should in head element, not body. How can I get correct element in this situation?

2 Answers2

2

Let me guess: are you using old version of Ubuntu (like 12.04)? Actually, it's a bug in old version of preinstalled libxml2 library used by lxml package. In the release notes for version 2.8.0 they mention fix for HTML parser error with <noscript> in the <head> - so I guess version of libxml2 >= 2.8.0 should work. Ubuntu 12.04 has version 2.7.8 installed.

>>> import lxml.etree
>>> lxml.etree.LIBXML_COMPILED_VERSION
(2, 7, 8)
>>> lxml.etree.LIBXML_VERSION
(2, 9, 1)

I think if any of these versions are >=2.8.0, the <noscript> issue should be gone.

Palasaty
  • 5,181
  • 1
  • 26
  • 22
1

This works for me:

import lxml.html

s = '<html><head><noscript></noscript><script></script><meta></head></html>' 
root = lxml.html.fromstring(s)
print(root.xpath('/html/head/meta'))
print(root.xpath('/html/body/meta'))

Output:

[<Element meta at 0x10a123b8>]
[]

I'm using Python 2.7.9 and lxml version 3.4.2.

anthony sottile
  • 61,815
  • 15
  • 148
  • 207
gtlambert
  • 11,711
  • 2
  • 30
  • 48
  • But I got this error: `AttributeError: 'module' object has no attribute 'html'` –  Sep 07 '15 at 11:35
  • Check which lxml version you are using with `from lxml import etree`, `etree.LXML_VERSION`. I'm struggling to replicate your problem – gtlambert Sep 07 '15 at 11:36
  • I'm using Python 2.7.3 and LXML 3.4.4 –  Sep 07 '15 at 11:38
  • I'll try to upgrade Python later and re-check again. Thanks! –  Sep 07 '15 at 11:38
  • @Kid, re your **AttributeError**, try relative imports: `from lxml import html` and change the usage of `lxml.html.fromstring` to `html.fromstring`. – Anzel Sep 07 '15 at 12:47
  • @Anzel: Thanks! But I think the problem is about Python version. I upgraded Python to 2.7.6 but no luck. –  Sep 07 '15 at 12:57