Urllib combined together with elementtree

Question

I'm having a few problems with parsing simple HTML with use of the ElementTree module out of the standard Python libraries. This is my source code:

from urllib.request import urlopen
from xml.etree.ElementTree import ElementTree

import sys

def main():
    site = urlopen("http://1gabba.in/genre/hardstyle")
    try:
        html = site.read().decode('utf-8')
        xml = ElementTree(html)
        print(xml)
        print(xml.findall("a"))        
    except:
        print(sys.exc_info())

if __name__ == '__main__':
    main()

Either this fails, I get the following output on my console:

<xml.etree.ElementTree.ElementTree object at 0x00000000027D14E0>
(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'findall'",), <traceback object at 0x0000000002910B88>)

So xml is indeed an ElementTree object, when we look at the documentation we'll see that the ElementTree class has a findall function. Extra thingie: xml.find("a") works fine, but it returns an int instead of an Element instance.

So could anybody help me out? What I am misunderstanding?

score 2 · Accepted Answer · edited May 23 '17 at 12:22

2

Replace ElementTree(html) with ElementTree.fromstring(html), and change your import statement to say from xml.etree import ElementTree.

The problem here is that the ElementTree constructor doesn't expect a string as its input -- it expects an Element object. The function xml.etree.ElementTree.fromstring() is the easiest way to build an ElementTree from a string.

I'm guessing that an XML parser isn't what you really want for this task, given that you're parsing HTML (which is not necessarily valid XML). You might want to take a look at:

edited May 23 '17 at 12:22

Community

1
1

answered Mar 12 '12 at 18:31

Edward Loper

15,374
7
43
52

1

Does not work, (, ParseError(ExpatError('mismatched tag: line 51, column 159',),), ) - while html has type 'str' so I don't know what's going wrong here.. – wvd Mar 12 '12 at 18:36
1

@wvd: In many cases, HTML files are not valid XML. E.g., HTML can contain
without a matching . ElementTree will fail unless the string you give it is 100% valid XML. In the case of the URL you gave, it includes an tag with no "close tag", which is valid HTML but not valid XML. – Edward Loper Mar 12 '12 at 18:39
Ah so it's complaining about that, makes sense! Thanks for the answer. – wvd Mar 12 '12 at 18:41

score 0 · Answer 2 · answered Mar 12 '12 at 18:47

The line should be:

xml = ElementTree(file=html)

P.S.: The above will work only when the XML is well-structured. If there is error in XML structure or bad HTML then it will raise ParseError.

You might like to use BeautifulSoup for HTML parsing. If your want to use XPATH and lxml, you might also like html5lib.

It is as easy as:

tree = html5lib.parse(html.content, treebuilder='lxml', namespaceHTMLElements=False)
# the tree is a lxml object (parsed from any/bad html) supporting findall and find with xpaths

Urllib combined together with elementtree

2 Answers2

Linked