python libxml2 reader and XML_PARSE_RECOVER

Question

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.

However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):

Sample code (with small example):

import cStringIO
import libxml2

DOC = "<a>some broken & xml</a>"

reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)

ret = reader.Read()
while ret:
    print 'ret: %d' % ret
    print "node name: ", reader.Name(), reader.NodeType()
    ret = reader.Read()

Any ideas how to recover correctly?

Yes: `while ret == 1:`. See http://xmlsoft.org/xmlreader.html. — Bernd Petersohn, Oct 06 '10 at 17:14
Thanks but this doesn't recover, just aborts. So for the above I'd only get the tag.The DOM api results in a document tree with the recovery just dropping the & - which is ideally what I'd like (equivalent) from the reader API. — bee, Oct 08 '10 at 08:46

score 1 · Answer 1 · edited Nov 14 '14 at 05:19

1

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)

edited Nov 14 '14 at 05:19

twasbrillig

17,084
9
43
67

answered Oct 29 '10 at 21:18

dcolish

22,727
1
24
25

Unfortunately I looked into lxml too, but your suggestion above uses the DOM api, due to the size of documents that isn't an option. The lxml iterparse API doesn't support recovery. – bee Oct 30 '10 at 09:58
If you're only trying to parse incrementally, look into the _FeedParser interface for lxml, I'll edit the above sample with its usage. I have not been able to find an iterative method for parsing that yields elements as they are parsed. http://codespeak.net/lxml/api/lxml.etree._FeedParser-class.html – dcolish Oct 30 '10 at 17:00
Thanks for all you efforts. Technically what we need is both incremental parsing and event-driven pulling of elements with recover. Shame lxml doesn't fit these requirements. – bee Oct 30 '10 at 18:28

Krab · Answer 2 · 2010-10-31T07:42:31.893

0

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?

For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.

EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

edited Oct 31 '10 at 07:42

answered Oct 30 '10 at 14:15

Krab

2,118
12
23

In the examples I've looked at each individual source has broken xml and all in different ways! Other common mistakes are casing of opening and closing tags not matching. It'd be difficult to work around every single one, reliably at least. To top it off them fixing the sources isn't an option - we have to support them as the previous provider did! – bee Oct 30 '10 at 18:36

score 0 · Answer 3 · answered Jan 16 '11 at 03:04

Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.

You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.

In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.

In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).

Good luck.

score 0 · Answer 4 · answered Jan 19 '11 at 17:00

0

Or, you could use BeautifulSoup. It does a nice job recovering broken ML.

answered Jan 19 '11 at 17:00

BeautifulSoup is DOM based so loads the entire document into memory, which doesn't meet the requirements. It's also rather slow for anything sizeable, memory requirements aside. – bee Jan 24 '11 at 07:10

python libxml2 reader and XML_PARSE_RECOVER

4 Answers4