33

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:

': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:

import urllib2
import lxml.etree as etree

file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()

tree = etree.parse(data)
larsks
  • 277,717
  • 41
  • 399
  • 399
daveeloo
  • 923
  • 3
  • 9
  • 8

4 Answers4

34

In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.

Example:

from lxml import etree
from StringIO import StringIO

myString = "<html><p>blah blah blah</p></html>"

tree = etree.parse(StringIO(myString))

This method is used in the lxml documentation.

Mark
  • 6,269
  • 2
  • 35
  • 34
kevin
  • 2,998
  • 4
  • 23
  • 17
16

etree.parse(source) expects source to be one of

  • a file name/path
  • a file object
  • a file-like object
  • a URL using the HTTP or FTP protocol

The problem is that you are supplying the XML content as a string.

You can also do without urllib2.urlopen(). Just use

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

Demonstration (using lxml 2.3.4):

>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>   

In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.

If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:

from lxml import etree

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0] 

# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()   
Community
  • 1
  • 1
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • @Duke: Thank you! It's nice to finally get some positive feedback. – mzjn Aug 07 '12 at 14:40
  • I'm getting this error while parsing a URL. Any idea how to disable loading those "external entities"? I have no interest in stylesheets, just want to parse anchors from the page. – MightyPork Feb 01 '14 at 10:10
  • @MightyPork: This question is not really about "external entities"; the error message is misleading. The problem here is that the OP uses `etree.parse()` on a string object, which does not work. If you have a related problem, I think you should ask a new question. – mzjn Feb 01 '14 at 13:14
  • @mzjn I asked indeed, but all advice I got was to use different library to fetch the url. I thought you have some more insight into the problem and could help. – MightyPork Feb 01 '14 at 14:54
  • Yes downvoter, this is very probably the problem. It was in my case. – Iharob Al Asimi Sep 29 '15 at 13:20
  • etree.parse('https://www.google.com'): OSError: Error reading file 'https://www.google.com': failed to load external entity "https://www.google.com" – vault Nov 11 '16 at 11:54
  • 2
    In my case, I got the error because I tried to open an https url with parse. – Shane Lu Jan 01 '18 at 14:23
  • @ShaneLu: Yes, for some reason HTTPS does not work. Only HTTP and FTP protocols are supported (as stated in the lxml documentation). – mzjn Jan 01 '18 at 14:40
  • Thanks for the tip. I was getting the "external entity" error with valgrind on a C program. I had to switch from file.nam to file:///file.nam to get the lxml to work with valgrind. Without valgrind, lxml works with file.nam. – VectorVortec Apr 05 '23 at 20:42
4

lxml docs for parse says To parse from a string, use the fromstring() function instead.

parse(...)
    parse(source, parser=None, base_url=None)

    Return an ElementTree object loaded with source elements.  If no parser
    is provided as second argument, the default parser is used.

    The ``source`` can be any of the following:

    - a file name/path
    - a file object
    - a file-like object
    - a URL using the HTTP or FTP protocol

    To parse from a string, use the ``fromstring()`` function instead.

    Note that it is generally faster to parse from a file path or URL
    than from an open file object or file-like object.  Transparent
    decompression from gzip compressed sources is supported (unless
    explicitly disabled in libxml2).
jrwren
  • 17,465
  • 8
  • 35
  • 56
2

You're getting that error because the XML you're loading references an external resource:

<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.

The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • Do you know if it's possible to turn off loading all external resources? I looked in the documentation but couldn't find anything. – daveeloo May 05 '12 at 05:55
  • 1
    "*You're getting that error because the XML you're loading references an external resource*". No. That is not why you get the error. Please see my answer. – mzjn Jul 11 '12 at 07:27