8

First off, let me say I am a new to SAX and Java.

I am trying to read information from an XML file that is not well formed.

When I try to use the SAX or DOM Parser I get the following error in response:

The markup in the document following the root element must be well-formed.

This is how I set up my XML file:

<format type="filename" t="13241">0;W650;004;AG-Erzgeb</format>
<format type="driver" t="123412">001;023</format>
   ...

Can I force the SAX or DOM to parse XML files even if they are not well formed XML?

Thank you for your help. Much appreciated. Haythem

Sunil D.
  • 17,983
  • 6
  • 53
  • 65
Haythem
  • 417
  • 4
  • 13
  • 20
  • 2
    FYI: By definition... If it's not well formed it's **not** XML. http://en.wikipedia.org/wiki/XML#Well-formedness_and_error-handling – Chris Nava Mar 23 '10 at 18:25

3 Answers3

20

Your best bet is to make the XML well-formed, probably by pre-processing it a bit. In this case, you can achieve that simply by putting an XML declaration on (and even that's optional) and providing a root element (which is not optional), like this:

<?xml version="1.0"?>
<wrapper>
    <format type="filename" t="13241">0;W650;004;AG-Erzgeb</format>
    <format type="driver" t="123412">001;023</format>
</wrapper>

There I've arbitrarily picked the name "wrapper" for the root element; it can be whatever you like.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 7
    I'd just like to add that you don't necessarily need to do that modification on the disk, but that you could do it on the fly by providing a filtering `InputStream`/`Reader`. Especially for big files (or reading XML from a URL) this can be very useful. A `SequenceInputStream` could be useful here: http://java.sun.com/javase/6/docs/api/java/io/SequenceInputStream.html – Joachim Sauer Mar 23 '10 at 11:34
  • Good posibility. is not easier to trun out the parse?. can I turn out the parse() mehtode and overwrite it to ignore the non-well-formed status? – Haythem Mar 23 '10 at 11:45
  • 2
    Haythem: probably not, because the parser is deep within the library and the behavior of such a browser would be undefined (the XML libraries don't know how to handle XML with more than one root element). Doing it this way instantly makes your XML well-formed and **all** XML-aware tools can suddenly handle it just fine (provided you have no other incorrect parts in there). – Joachim Sauer Mar 23 '10 at 11:58
1

Hint: using sax or stax you can successfully parse a not well formed xml document until the FIRST "well formed-ness" error is encountered.

(I know that this is not of too much help...)

Yaneeve
  • 4,751
  • 10
  • 49
  • 87
0

As the DOM will scan you xml file then build a tree, the root node of the tree is like the as 1 Answer. However, if the Parser can't find the or even , it can even build the tree. So, its better to do some pre-processing the xml file before parser it by DOM or Sax.

jasonfungsing
  • 1,625
  • 8
  • 22
  • 34