4

I try to convert an HTML page into a tree structure. I have derived this class (I removed what I actually do with each tag as it's not relevant) :

class PageParser(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("start "+tag)
    def handle_endtag(self, tag):
        print("end "+tag)
    def handle_startendtag(self, tag, attrs):
        print("startend "+tag)

I expected empty elements to trigger the handle_startendtag method. The problem is that, when encountering an empty element like <meta>, only the handle_starttag method is called. The tag is never closed from my class' point of view :

parser = PageParser()
parser.feed('<div> <meta charset="utf-8"> </div>')

prints :

start div
start meta
end div

I need to know when each element has been closed to correctly create the tree. How can I know if a tag is an empty element ?

Hey
  • 1,701
  • 5
  • 23
  • 43
  • 2
    From the docs: _"This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element."_ https://docs.python.org/3/library/html.parser.html – Hamish May 09 '17 at 10:26
  • You should either be inputting strict XML where the `` tag is no longer valid and you have to write it as `` or keep track of a list of tags that might come as empty tags, like the `
    ` or the `` tags.
    – Krishna Pradyumna Mokshagundam May 09 '17 at 10:27
  • 1
    http://stackoverflow.com/questions/3115448/best-way-to-convert-the-this-html-file-into-an-xml-file-using-python – Henry Heath May 09 '17 at 10:30

1 Answers1

3

Checking the documentation, and specifically this example:

Parsing an element with a few attributes and a title:

>>>parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
    attr: ('src', 'python-logo.png')
    attr: ('alt', 'The Python logo')

We can determine that this is the expected behavior.

The best suggestion come from @HenryHeath 's comment: Use BeautifulSoup.

If you don't want to use it though, you can work around the expected behavior of HTMLParser as follows:

  • This is a list of every HTML 5.2 void element.
  • Create a list with those element names:

    void_elements = ['area', 'base', ... , 'wbr']
    
  • In handle_starttag check if the tag is in the void_elements list:

    class PageParser(html.parser.HTMLParser):
        def handle_starttag(self, tag, attrs):
            if tag in void_elements:
                # DO what should happen inside handle_startendtag()
                print("void element "+tag)
            else:
                print("start "+tag)
    

Good luck :)

John Moutafis
  • 22,254
  • 11
  • 68
  • 112
  • 1
    Your solution worked, thank you. Parsing a Twitter page, I found that they use the `link` tag as an empty element, but it's not in the list you linked to. I don't know if it should (maybe it's not standard), but I put it there in case someone encounters the same problem. – Hey May 09 '17 at 11:01