3

I have the following XML:

<root>
   <child value="&#xFF;&#xEF;&#x99;&#xE0;"/>
</root>

When I do a transform I want the character hex code values to be preserved. So if my transform was just a simple xsl:copy and the input was the above XML, then the output should be identical to the input.

I have read about the saxon:character-representation function, but right now I'm using Saxon-HE 9.4, so that function is not available to me, and I'm not even 100% sure it would do what I want.

I also read about use-character-maps. This seems to solve my problem, but I would rather not add a giant map to my transform to catch every possible character hex code.

<xsl:character-map name="characterMap">
    <xsl:output-character character="&#xA0;" string="&amp;#xA0;"/>
    <xsl:output-character character="&#xA1;" string="&amp;#xA1;"/>
    <!-- 93 more entries... &#xA1; through &#xFE; -->
    <xsl:output-character character="&#xFF;" string="&amp;#xFF;"/>
</xsl:character-map>

Are there any other ways to preserve character hex codes?

ubiquibacon
  • 10,451
  • 28
  • 109
  • 179
  • Relates to/is a duplicate of http://stackoverflow.com/questions/5985615/preserving-entity-references-when-transforming-xml-with-xslt – Daniel Haley Oct 01 '13 at 23:20

1 Answers1

1

The XSLT processor doesn't know how the character was represented in the input - that's all handled by the XML parser. So it can't reproduce the original.

If you want to output all non-ASCII characters using numeric character references, regardless how they were represented in the input, try using xsl:output encoding="us-ascii".

If you really need to retain the original representation - and I can't see any defensible reason why anyone would need to do that - then try Andrew Welch's lexev, which converts all the entity and character references to processing instructions on the way in, and back to entity/character references on the way out.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • The reason I need the original representation is for diffing purposes and because the XML in question is being used as configuration for an embedded system which understand hex and little else. I'll give `encoding="us-ascii"` a try though. – ubiquibacon Oct 01 '13 at 22:53
  • For diffing XML, you should convert both documents to canonical form (or parse both and compare the trees) rather than trying to reproduce the lexical representation of one document in the other. Otherwise you'll get caught by lots of trivial differences such as order of attribute and namespace declarations. – Michael Kay Oct 02 '13 at 09:47
  • And if your "embedded system" is reading XML without using a proper XML parser, then I can only sympathize. – Michael Kay Oct 02 '13 at 09:48
  • Yes, I first sort and canonicalize the documents. The diffing process is automated, that is why a lexical representation is required. If I read a document in I need to be able to read the same document out with the same hash code. If the hash codes don't match then I know something went wrong. I don't know what XML parser our systems use. It is probably something our organization made to adhere to the regulations we have to follow. I'm just doing my part :-) – ubiquibacon Oct 02 '13 at 12:19
  • Using `encoding="us-ascii"` works, so long as my output method is XML, which it is not. Saxon chokes on illegal characters if the output method is HTML and encoding is us-ascii as described [here](http://stackoverflow.com/questions/4430823/allow-invalid-html-characters-in-xslt-transformation). Even though I am dealing with XML I use the HTML output method because the indentation is a bit more sane (no line wrapping). I'm not prepared to give up the better indentation of the HTML method so I guess I'll have to continue using the character map. – ubiquibacon Oct 09 '13 at 15:04