How to remove namespace value from inside lxml.html.html5paser element tag

Question

Is it possible not to add namespace for the tag when using html5parser from the lxml.html package?

Example:

from lxml import html
print(html.parse('http://example.com').getroot().tag)
# You will get 'html'

from lxml.html import html5parser
print(html5parser.parse('http://example.com').getroot().tag)
# You will get '{http://www.w3.org/1999/xhtml}html'

The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all?

score 2 · Accepted Answer · answered Jan 27 '16 at 03:44

2

There is a specific namespaceHTMLElements boolean flag that controls this behavior:

from lxml.html import html5parser
from html5lib import HTMLParser

root = html5parser.parse('http://example.com', 
                         parser=HTMLParser(namespaceHTMLElements=False))    
print(root.tag)  # prints "html"

answered Jan 27 '16 at 03:44

alecxe

462,703
120
1,088
1,195

In principle this should work for lxml's API too AIUI, but see [this question](http://stackoverflow.com/questions/32731479/lxml-html5parser-ignores-namespacehtmlelements-false-option) about that. – gsnedders Jan 28 '16 at 20:14

How to remove namespace value from inside lxml.html.html5paser element tag

1 Answers1

Linked