0

I'm beautifying/indenting some XML in Java:

<div xml:space="default"><h1 xml:space="default">Indenting mixed content in Java</h1><p xml:space="preserve">Why does indenting mixed content (like this paragraph) add whitespace around <a href="http://www.stackoverflow.com" xml:space="preserve"><strong>this strong element</strong></a>?</p></div>

When I beautify the XML, I don't want whitespace added to the contents of the <a> element, so I've specified xml:space="preserve" expecting the transformer to preserve the white space therein.

However when I transform the XML, I get this:

<div>
    <h1 xml:space="default">Indenting mixed content in Java</h1>
    <p>Why does indenting mixed content (like this paragraph) add whitespace around <a href="http://www.stackoverflow.com">
            <strong xml:space="preserve">this strong element</strong>
        </a>?</p>
</div>

... with extra whitespace between the <a> and the <strong> element. (Not only that, but the </a> close tag awkwardly doesn't line up with its open tag.)

How can I prevent the prettifier from adding that white space? Am I doing something wrong? Here's the Java code I'm using:

import org.w3c.dom.Element;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import java.io.ByteArrayInputStream;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import java.io.StringWriter;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.stream.StreamResult;

public class XmlExample {

    public static void main(String[] argv) {
        Document xmlDoc    = parseXml("<div xml:space=\"default\">" + 
                                          "<h1 xml:space=\"default\">Indenting mixed content in Java</h1>" + 
                                          "<p xml:space=\"preserve\">Why does indenting mixed content (like this paragraph) add whitespace around " + 
                                              "<a href=\"http://www.stackoverflow.com\" xml:space=\"preserve\"><strong>this strong element</strong></a>?" + 
                                          "</p>" + 
                                      "</div>");
        String   xmlString = xmlToString(xmlDoc.getDocumentElement());
        System.out.println(xmlString);
    }

    public static Document parseXml(String xml) {
        try {
            DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
            docFactory.setNamespaceAware(true);
            DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

            Document doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes("UTF-8"))); 
            return doc;
        }
        catch(Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static String xmlToString(Element el) {
        try {
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer transformer = tf.newTransformer();
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            StringWriter writer = new StringWriter();
            DOMSource source = new DOMSource(el);
            transformer.transform(source, new StreamResult(writer));
            return writer.getBuffer().toString().trim();
        }
        catch(Exception e) {
            throw new RuntimeException(e);
        }
    }

}
JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119
  • I can't answer your question definitively - you are processing mixedContent and the indentation rules may be processor dependent. It might be worth trying another XSLT engine such as Saxon. Here's another SO Question on whitespace http://stackoverflow.com/questions/1384802/java-how-to-indent-xml-generated-by-transformer – peter.murray.rust Jun 11 '13 at 22:17
  • @peter.murray.rust - Ya, that question and I have become very good friends recently :P I'll look into Saxon though. – Richard JP Le Guen Jun 11 '13 at 22:20
  • I'd probably trust Saxon most. Mike Kay has helped create the specs and will be very thorough on things like this. And he'll probably give you a direct answer. – peter.murray.rust Jun 11 '13 at 22:24

1 Answers1

2

If you use a serializer that conforms to the XSLT 1.0 or XSLT 2.0 specifications, then it should respect xml:space (that is, within the scope of xml:space="preserve", indentation should be suppressed). The XSLT 2.0 specification is much more explicit on this point than XSLT 1.0, and makes it a "MUST" rather than a "SHOULD" requirement.

You're using a JAXP identity transformation rather than an XSLT transformation; there's a reference from the JAXP specs to the XSLT 1.0 specs but it's a bit woolly.

If you use Saxon you should get the desired behaviour. Saxon also allows you to suppress indentation for specific elements using the SUPPRESS_INDENTATION output parameter, so you don't even have to include xml:space in the document being serialized.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • +1 I remember lots of discussion on whitespace and as you indicate I suspect it wasn't consistently understood or implemented in 1.0 versions. – peter.murray.rust Jun 12 '13 at 09:28