3

I'm receiving XML files from an external source over which I have no control. Some of the XML files are broken. Specifically, towards the end of the file, some closing tags are missing. It goes something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<a>
  <b>
    <c/>
  </b>
  <b>
    <c/>
</a>

I think our system will be fine if we simply ignore the elements that don't have a matching closing tag.

What library can I use to parse what I can from such XML files?

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • Do you have a schema for the documents? It seems like that could make a difference on how easy it is to recover from errors... – xdhmoore Oct 16 '14 at 14:31
  • There are parsing techniques that can recover from such errors in various ways. But I do not know what might be available for XML. And I doubt you want to develop that yourself. – babou Oct 16 '14 at 14:35
  • Using StAX seems to do the trick – Steve McLeod Oct 16 '14 at 14:48
  • What would you do if someone sent you broken Javascript? What do you do if there's a fly in your soup? Complain to the supplier, please, or things will never get better. – Michael Kay Oct 16 '14 at 15:30
  • And...another question gets marks as a duplicate, when it isn't actually a duplicate. sigh. – Steve McLeod Oct 16 '14 at 17:41

3 Answers3

1

You will need to manually parse it yourself, no XML parser will work on XML that's not well formed. One possibility is to use a SAX parser, it will parse the document up to the error then stop.

Rocky Pulley
  • 22,531
  • 20
  • 68
  • 106
0

An XML parser should not support this kind of behavior. But if you can identify whats wrong with the file you could react, clean it up and try again.

Celeb
  • 88
  • 4
0

Idk if JSoup would work. It's supposed to be forgiving for HTML. Idk about XML.

xdhmoore
  • 8,935
  • 11
  • 47
  • 90