0

I am trying to parse a very big XML file and do lower case and remove punctuation. The problem is that when I try to parse this file using the cET parse function for big files, at some point it comes across a bad formatted tag or character which raises syntax error:

SyntaxError: not well-formed (invalid token): line 639337, column 4

Note: It is nearly impossible for me to read the file, so I can not see where the problem is.

How can I skip or fix this?

from xml.etree import cElementTree as cET

for event, elem in cET.iterparse(xmlFile, events=("start", "end")):
    ...do something...
user1262403
  • 31
  • 2
  • 4

2 Answers2

4

Use lxml instead of the standard library ElementTree; it supports the same API, but can handle broken XML; it'll attempt to repair it if at all possible:

parser = etree.XMLParser(recover=True)
context = etree.iterparse(filename, parser)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

You could use a tool like xmllint to verify and clean your XML. The errors reported by this tool should help you to fix the XML file.

Edit: An example:

$ cat invalid.xml 
<?xml version="1.0"?>
<foo>
<bar>
</foo>
$ xmllint invalid.xml 
invalid.xml:4: parser error : Opening and ending tag mismatch: bar line 3 and foo
</foo>
      ^
invalid.xml:5: parser error : Premature end of data in tag foo line 2

^
  • Thank you, but even if i see where the error is I still cannot fix anything in the file. It takes minutes even to open it, let alone navigate in it. Also I assume there are plenty of such errors. I think it has to do with the encoding when I open the file. Otherwise I would like to find a way to skip this part and iterate to the next – user1262403 Oct 14 '12 at 13:48