2

I have written an XSLT to process some set of XMLs.
XSLT is processing fine, but these XMLs are having different set of encoding. Currently I am using output tag as shown:

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

But this will forcefully change encoding to UTF-8, but I need the value same as present in actual XML document.
How can I get this?

pankaj_ar
  • 757
  • 2
  • 10
  • 33
  • 1
    There is no pure XSLT way of achieving that as the XSLT processor does not know the encoding of the input. In which environment (e.g. .NET, Java) do you use XSLT? You will need to write code to have an XML parser determine the encoding of the input and then to manipulate the output encoding of the XSLT result serialization. – Martin Honnen Aug 14 '15 at 11:44
  • @MartinHonnen: that depends on what you mean by "pure". You can actually do it in both XSLT 2.0 and 3.0 without resorting to extension functions. With XSLT 1.0 (not specified here) it becomes a different story. – Abel Aug 17 '15 at 15:34

1 Answers1

2

But this will forcefully change encoding to UTF-8, but I need the value same as present in actual XML document.

From the point of view of XML, there is no difference what encoding is used, as long as the proper characters are escaped (which is done for your by the XSLT processor). Every XML processor is required to support UTF-8, UTF-16 and US-ASCII. The latter can be used for instance if your XML must be transferred using old techniques that would otherwise mess with the UTF encoding (some older FTP systems for instance).

That said, in XSLT 2.0 and 3.0 there are ways of doing this dynamically by simply using xsl:result-document, and a trick by loading the XML as unparsed text:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:f="http://example.com/functions">

    <xsl:template match="/">
        <xsl:result-document href="output-filename" encoding="{f:get-encoding(.)}">
            <!-- your code -->
        </xsl:result-document>
    </xsl:template>

    <xsl:function name="f:get-encoding">
        <xsl:param name="node" />
        <xsl:variable name="regex">^.*encoding=['"]([a-zA-Z0-9-]+)["'].*$</xsl:variable>
        <xsl:value-of select="replace(tokenize(unparsed-text($node/base-uri()), '\n')[1], $regex, '$1')"/>        
    </xsl:function>

 </xsl:stylesheet>

Or even on xsl:output for XSLT 3.0 using

In short, just a few lines of code that show quite a few new concepts of XSLT, XPath and XDM:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:param name="input-url" static="yes" select="'yourinput.xml'" />

    <xsl:variable name="get-encoding" static="yes" select='
        let $regex := "^.*encoding=[&apos;""]([a-zA-Z0-9-]+)[&apos;""].*$"
        return function($n) {
            replace(tokenize(unparsed-text($n), "\n")[1], $regex, "$1")
        }' />

    <!-- a shadow attribute is replaced with the actual attribute by the same name -->
    <xsl:output _encoding="{$get-encoding($input-url)}" />

    <xsl:template match="/">
        <!-- your code here -->
        <result />
    </xsl:template>

</xsl:stylesheet>

This code runs correctly with Exselt, but my version of Saxon did not (yet) support it (it does not allow the use of unparsed-text in a static expression), but I'm sure that'll come soon, or is something that is somehow configurable. I didn't test other XSLT processors.

Abel
  • 56,041
  • 24
  • 146
  • 247
  • Does the regular expression based encoding parsing with `^.*encoding=['"]([a-zA-Z0-9-]+)["'].*$` work for you with Saxon? I tried your sample with Saxon 9.6 HE and Java 1.8 on Windows but it threw errors like `SESU0007: Invalid encoding name: ` on the `xsl:result-document` meaning the encoding name was not extracted. It looks like there is a trailing white space `0xD` character in the string that the `.*` does not match. – Martin Honnen Aug 17 '15 at 17:26