lxml parser removing closing tag when parsing html

Asked Feb 06 '20 at 16:44

Active Feb 07 '20 at 08:30

Viewed 231 times

I have the below HTML content:

<html>

<body>
    <div>
        <p><img class="img.jpg" /></p>
    </div>
</body>

</html>

and i am trying to parse the HTML using lxml parser as below:

import lxml.html as LH
root = LH.fromstring(html)
for el in root.iter('img'):
    el.attrib['src'] = el.attrib['class']
content = '<html><body>' + LH.tostring(root) + '</body></html>'

I am getting the content after parsing as below:

<html>

<body>
    <div>
        <p><img class="img.jpg" src="img.jpg"></p>
    </div>
</body>

</html>

As you can see, the <img>'s closing tag </> has been removed after parsing. Is there anyway I can retain all the HTML closing tags after HTML parsing?

edited Feb 07 '20 at 08:30

Nishant

20,354
18
69
101

asked Feb 06 '20 at 16:44

venu gopal

Is there any way i can achieve using html parser of lxml or I have to use xml parser of lxml? – venu gopal Feb 07 '20 at 07:33
1

Does this answer your question? [Why is the tag not closed in HTML?](https://stackoverflow.com/questions/23890716/why-is-the-img-tag-not-closed-in-html) – Nishant Feb 07 '20 at 08:09
I think it is not needed in HTML - please see the linked question. Will it work if you parse as XML? – Nishant Feb 07 '20 at 08:11

lxml parser removing closing tag when parsing html

0 Answers0