14

With Java 9 there was a change in the way javax.xml.transform.Transformer with OutputKeys.INDENT handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:

<test><![CDATA[data]]></test>

But with Java 9 the same results in

<test>
    <![CDATA[data]]>
</test>

Which is not the same XML.

I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory with setIgnoringElementContentWhitespace=true but this no longer works for Java 11.

Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).

Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.

Sample program to demonstrate the issue:

public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
    String data = "data";

    StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
    StreamResult result = new StreamResult(new StringWriter());

    Transformer tform = TransformerFactory.newInstance().newTransformer();
    tform.setOutputProperty(OutputKeys.INDENT, "yes");
    tform.transform(source, result);

    String xml = result.getWriter().toString();

    System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11


    Document document = DocumentBuilderFactory.newInstance()
        .newDocumentBuilder()
        .parse(new InputSource(new StringReader(xml)));

    String resultData = document.getElementsByTagName("bar")
        .item(0)
        .getTextContent();

    System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}

EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291

Rick
  • 935
  • 2
  • 7
  • 22
  • 2
    You should edit your question and add a sample Java code that demonstrates the problem (generate a small XML + transform). It is a lot easier to start with a working example. – Robert Apr 26 '19 at 18:01

2 Answers2

5

As your code relies on unspecified behavior, extra explicit code seems better:

  • You want indentation like:

      tform.setOutputProperty(OutputKeys.INDENT, "yes");
      tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
    
  • However not for elements containing a CDATA tag:

      String xml = result.getWriter().toString();
      // No indentation (whitespace) for elements with a CDATA section.
      xml = xml.replaceAll("(?s)>\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");
    

The regex uses:

  • (?s) DOT_ALL to have . match any character, also newline characters.
  • .*? the shortest matching sequence, to not match "...]]>...]]>".

Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.

Rick
  • 935
  • 2
  • 7
  • 22
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Thanks! That's actually a pretty clean workaround. I am wondering what you mean by my code relying on unspecified behavior? – Rick Apr 29 '19 at 13:40
  • You are telling that the transformation should do a pretty-print; indent every element. But the newest java version does indeed that: indenting also CDATA sections. So that reeks of an earlier exception made for CDATA. In every case one cannot find fault with the specification. – Joop Eggen Apr 29 '19 at 13:44
  • Well, CDATA can be followed by 'normal' data. For example, this is valid: <![CDATA[data]]>foo. By adding additional whitespace, the contents of the XML change. So I do think this is an issue with the Transformer. – Rick Apr 29 '19 at 13:55
  • Then why INDENT=yes? One can restrict in DTD/XSD the allowed content, but I do not think that plays a role here (or validation in general). Would INDENT="no" not suffice, if you are reading in a DOM afterwards. – Joop Eggen Apr 29 '19 at 14:03
  • 3
    The issue with CDATA has been fixed in Java 14. I test it in the ea version: openjdk version "14-ea" 2020-03-17 OpenJDK Runtime Environment (build 14-ea+6-171) – JuanMoreno Jul 28 '19 at 04:38
  • Verified that it indeed works with the ea version of OpenJDK 14. Thanks! – Rick Jul 29 '19 at 14:04
1

The solution from Joop Eggen is brilliant.

I just want to expand the solution a little bit.

xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[(.|\\n|\\r\\n)*?]\\]>)\\s*</", ">$1</");

In this regex I include the possibility that inside the CDATA tag new lines are allowed. So I am testing for \n and also windows-style \r\n

XML Example:

<test>
   <![CDATA[com.foo.test]]>
</test
<test>
 <![CDATA[2st Line   
2nd Line]]>
</test>
Ralph
  • 4,500
  • 9
  • 48
  • 87
  • Joop Eggen mentions prefixingthe regex with (?s) to make .* match newlines. While he did not actually include it in the regex in his answer, I think I used it to solve my problem at the time. – Rick Feb 26 '23 at 21:54
  • I have edited Joop Eggen's answer to include the (?s) in the regex, I'll leave it up to future readers to decide which regex they prefer to use :) – Rick Feb 27 '23 at 08:24