1

Is it possible not to add namespace for the tag when using html5parser from the lxml.html package?

Example:

from lxml import html
print(html.parse('http://example.com').getroot().tag)
# You will get 'html'

from lxml.html import html5parser
print(html5parser.parse('http://example.com').getroot().tag)
# You will get '{http://www.w3.org/1999/xhtml}html'

The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all?

Renat
  • 417
  • 4
  • 12

1 Answers1

2

There is a specific namespaceHTMLElements boolean flag that controls this behavior:

from lxml.html import html5parser
from html5lib import HTMLParser

root = html5parser.parse('http://example.com', 
                         parser=HTMLParser(namespaceHTMLElements=False))    
print(root.tag)  # prints "html"
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • In principle this should work for lxml's API too AIUI, but see [this question](http://stackoverflow.com/questions/32731479/lxml-html5parser-ignores-namespacehtmlelements-false-option) about that. – gsnedders Jan 28 '16 at 20:14