1

I have a long list of xml files which may have different encodings. I would like to go through all the files and print their encodings. Printing the encoding attribute in the XML header is just a first step. (The next step, once I find out how to get access to the encoding attribute would be to use the encoding attribute to test if this is the expecting encoding.)

This is how input xml files may look like:

<?xml version="1.0" encoding="iso-8859-1"?>
<Resource Name="text1" Language="de">
    <Text>
    </Text>
</Resource>


<?xml version="1.0" encoding="utf-8"?>
<Resource Name="file2" Language="ko">
    <Text>
    </Text>
</Resource>

The xsl, which has been cut down to a minimum but still without any success. I think I fail to match the XML header by writing this way. But how can I match something in the XML header?

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html"/>

    <xsl:template match="/">
     <html>
        <body>   
            <xsl:value-of select="@encoding"/>
        </body>
     </html>
    </xsl:template>
</xsl:stylesheet>
Gunilla
  • 267
  • 1
  • 13

1 Answers1

3

The encoding pseudo-attribute of the XML prolog is not relevant anymore after you read the XML with an XML capable processor. Unless the encoding in the prolog does not match the encoding used and the file contains characters that cannot be represented in that encoding.

The only way I know of to use XSLT to get the encoding is to use the functions unparsed-text (XSLT 2.0) or unparsed-text-lines (XSLT 3.0) and then use regular expressions (replace or xsl:analyze-string, both XSLT 2.0) to parse the prolog by hand.

Since XSLT (and most XML capable tools and processors) see XML not as a text file but as a set of nodes with streams of characters, not streams bytes, the requirement to read the encoding is hardly ever needed.

If you want to know the encoding for functions like document, doc or unparsed-text, those functions are defined such that they will read the encoding from the prolog and use that. In XSLT 3.0 you can use try/catch to find out whether or not it succeeded to parse a file. In XSLT 2.0 you have doc-available, which will return false if the encoding does not match the bytes used.

Abel
  • 56,041
  • 24
  • 146
  • 247
  • Oh, I see... I suspected the XML prolog was special in some way. I am using the document function to process a long list if files, in many languages. The "normal" encoding for most of the files in the list should be "iso-8859-1", however sometimes a file in for instance Polish or Romanian also gets this encoding. Then this file need to be checked extra because some characters may be have been ruined in the process. So for these files it is preferred that they remain in UTF-8 as long as possible. – Gunilla Sep 04 '15 at 14:34
  • @Gunilla: do you mean to say that the files are encoded as UTF-8, but have a prolog with a different encoding? The best you can do is pre-process your list of files and simply remove the prolog (just remove the first line), as the default encoding if the prolog is absent, is UTF-8 or UTF-16 (the difference is something that can be automatically determined from the BOM). – Abel Sep 04 '15 at 15:10
  • Good idea! If all files are default encoded as UTF-8 if prolog is absent, no characters will ever be ruined. And for those files/languages that I know will go to the target system that requires iso-8859-1 encoding, a prolog with encoding iso-8859-1 can be added, using the transform . I think this will simplify the maintenance of these files and reduce risk of errors. Thank you very much! – Gunilla Sep 07 '15 at 12:24