Note: Actual question at the very end.
I'm thoroughly confused by what I see while trying to juggle newline/linebreaks in a source XML file via xslt when comparing MSXML (IE11) with libxml2 / Firefox.
Essentially, both libxml2 and Firefox implement XML End-of-Line Handling
XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).
To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
Now, it seems I can easily establish that IE11's MSXML does not implement this properly.
Given an xml file
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="test.xsl"?>
<root>
<text>We would like:
* Free icecream
* Free beer
* Free linebreaks</text>
</root>
that contains Windows CRLF line endings in a text node, and using this xsl:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<html>
<body>
<xsl:if test="contains(//text, '
')">
<p>The text contains CR+LF (0x0D+0x0A).</p>
</xsl:if>
<xsl:if test="contains(//text, '
')">
<p>The text contains CR (0x0D).</p>
</xsl:if>
<xsl:if test="contains(//text, '
')">
<p>The text contains LF (0x0A).</p>
</xsl:if>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
MSXML will print
The text contains CR+LF (0x0D+0x0A).
The text contains CR (0x0D).
The text contains LF (0x0A).
wheras both FF and libxml2 (xsltproc.exe
) will only print:
The text contains LF (0x0A).
So far so bad. The real question now is when I use substring-before
and substring-after
to isolate the newlines.
Adding this xsl:
<xsl:value-of select="'before-xA:{'"/>
<xsl:value-of select="substring-before(//text, '
')" />
<xsl:value-of select="'}='"/>
<xsl:value-of select="contains(substring-before(//text, '
'), '
')" />
<xsl:value-of select="' / after-xD:{'"/>
<xsl:value-of select="substring-after(//text, '
')" />
<xsl:value-of select="'}='"/>
<xsl:value-of select="contains(substring(substring-after(//text, '
'), 1, 2), '
')" />
IE11 prints:
before-xA:{We would like:}=false / after-xD:{* Free icecream * Free beer * Free linebreaks}=false
That is, even though MSXML sees both the CR and LF in the source XML, applying substring-before
/ substring-after
the resulting substring will not contain either, although it should as far as I can tell.
So, what's going on here? Have I missed sth. about the substring-* functions? Is MSXML inconsistent?