How to keep whitespace before document element when parsing with Java?

Question

In my application, I alter some part of XML files, which begin like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- $Id: version control yadda-yadda $ -->

<myElement>
...

Note the blank line before <myElement>. After loading, altering and saving, the result is far from pleasing:

<?xml version="1.0" encoding="UTF-8"?>
<!-- $Id: version control yadda-yadda $ --><myElement>
...

I found out that the whitespace (one newline) between the comment and the document node is not represented in the DOM at all. The following self-contained code reproduces the issue reliably:

String source =
    "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<!-- foo -->\n<empty/>";
byte[] sourceBytes = source.getBytes("UTF-16");

DocumentBuilder builder =
    DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc =
    builder.parse(new ByteInputStream(sourceBytes, sourceBytes.length));

DOMImplementationLS domImplementation =
    (DOMImplementationLS) doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
System.out.println(lsSerializer.writeToString(doc));

// output: <?xml version="1.0" encoding="UTF-16"?>\n<!-- foo --><empty/>

Does anyone have an idea how to avoid this? Essentially, I want the output to be the same as the input. (I know that the xml declaration will be regenerated because it's not part of the DOM, but that's not an issue here.)

I ended up hacking this into the output using a custom OutputStream class that looks for the first occurence of "--><" and adds two newlines; I only use this stream if the first document child node is in fact a comment. Still a hack, but at least neatly encapsulated :-) — Jens Bannmann, May 20 '09 at 09:55
I have the same issue.Could you please help? http://stackoverflow.com/questions/30940162/dom-parser-wrong-childnodes-count — user3930361, Jun 23 '15 at 15:57

score 6 · Answer 1 · answered May 15 '09 at 14:33

6

I had the same problem. My solution was to write my own XML parser: DecentXML

Main feature: it can 100% preserve the original input, whitespace, entities, everything. It won't bother you with the details, but if your code needs to generate XML like this:

 <element
     attr="some complex value"
     />

then you can.

answered May 15 '09 at 14:33

Aaron Digulla

321,842
108
597
820

Thanks for the suggestion; DecentXML certainly looks like a nice thing to keep in mind! *bookmarksIt* Good to see that at least one of the "yet-another-parser" projects has a really good reason to exist. However, for my current problem, I'd much rather stay with the standard DOM API throughout my processing code, and simply add the line in the output stage. – Jens Bannmann May 17 '09 at 18:25
Then you need to add the text nodes manually at before the root element. Look at the Document object how to add normal (non-element) nodes. If that's not possible, you must create a filter for the writer/output stream which hacks the newline in there. – Aaron Digulla May 18 '09 at 07:20
@AaronDigulla::Can you help me on this http://stackoverflow.com/questions/30940162/dom-parser-wrong-childnodes-count – user3930361 Jun 23 '15 at 15:57

score 3 · Answer 2 · answered May 15 '09 at 14:15

3

Why do you want to avoid this?

The white-space outside of tags/elements is defined as insignificant by the spec. It simply does not exist, as far as the infoset is concerned that is represented by your DOM.

Consequently, upon serializing the DOM again, it will not be there.

If you are in the process of developing something that relies on this empty line... Don't.

answered May 15 '09 at 14:15

Tomalak

332,285
67
532
628

No program relies on this format, of course. However, the files contain translation data; they're checked in to version control and maintained continously. Thus, it would be nice for viewing diffs if the only changes my app does are intentional ones. – Jens Bannmann May 15 '09 at 14:25
I thought so... I think the only sensible way of dealing with that is not to have this empty line in the files to start with. I don't think there is any recommendable method of retaining this line. Maybe the files should be as a rule passed through a tidying tool before checkin to avoid these inconsistencies. – Tomalak May 15 '09 at 14:30
@Tomalak::Can you help me : http://stackoverflow.com/questions/30940162/dom-parser-wrong-childnodes-count – user3930361 Jun 23 '15 at 15:59

score 3 · Accepted Answer · answered May 15 '09 at 15:43

3

The root cause is that the standard DOM Level 3 cannot represent Text nodes as children of a Document without breaking the spec. Whitespace will be dropped by any compliant parser.

Document -- 
    Element (maximum of one),
    ProcessingInstruction,
    Comment,
    DocumentType (maximum of one)

If you require a standards-compliant solution and the objective is readability rather than 100% reproduction, I would look for it in your output mechanism.

answered May 15 '09 at 15:43

McDowell

107,573
31
204
267

Good answer, but this is a stupid bug in the spec in my opinion. You can certainly output text before the document element, but you can't input it? – Archie Mar 31 '11 at 16:46
@McDowell any thing can we do to avoid this,please look into my question. http://stackoverflow.com/questions/30940162/dom-parser-wrong-childnodes-count – user3930361 Jun 23 '15 at 15:57

score 1 · Answer 4 · answered May 15 '09 at 14:14

1

In general white spaces are considered irrelevant in XML and are thus not preserved when an XML file is parsed. Most libraries that output XML have an option for outputting it with nice formatting and correct indentations but it will always be fairly generic. No "have an extra line right here".

answered May 15 '09 at 14:14

Kris

14,426
7
55
65

1

The point is that there *was* a line in the original input, and it should be kept - as is the case for all whitespace in the remainder of the document! – Jens Bannmann May 15 '09 at 14:15

score 0 · Answer 5 · answered May 15 '09 at 14:33

I agree with Kris and Tomalak, the blank line is not relevant from the XML point of view. If your application needs to produce a blank line in the output, I would suggest to review the need of that requirement.

Anyway, if you still want that blank line to appear, I would suggest to download the source code of the XML parser you are using and modify that behaviour. But keep in mind that this is not standard XML and it will not be compatible with other applications.

Jdom Source
Dom4j Source Check org.dom4j.io.DOMWriter

What about XML files that are meant to be edited by human beings? In that case the original formatting is important. XML is not only for serialization, if it was then a binary format would be better. — MarioVilas, Apr 30 '12 at 14:57

How to keep whitespace before document element when parsing with Java?

5 Answers5

Linked

Related