0

I'm encountering issue with emojis when trying to generate html output using xsl transformation under certain circumstances.

For instance, I've tested following xsl with different transformation engines:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8"/>
  <xsl:template match="/">
    <xsl:text disable-output-escaping="yes">&lt;!doctype html&gt;</xsl:text>
    <html>
      <head>
        <meta charset="UTF-8"/>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
      </head>
      <body>
        <textarea></textarea><br/>
        <input type="text" value=""/>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

I tested with exact same code (based on JAXP definition) for all transformers. I only changed the transformer instance class reference.

Saxon gives correct result:

enter image description here

Java internal repackaged transformer based on xalan (aka com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl) is correct when emoji is put as text in textarea body, but generates wrong result for <input> field: it seems that emoji is wrong encoded when put in value attribute:

enter image description here

Xalan 2.7.2 gives even worse result:

enter image description here

For different reasons (mainly license one), I would prefer using Xalan transformer. Any idea how I can make xalan manage emoji correctly ?

EDIT

The transformation is performed with following code:

TransformerFactory factory = TransformerFactory.newInstance(
        "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl",
        null);
Transformer transformer = factory.newTransformer(new StreamSource(xsl));
DocumentSource domSource = new DocumentSource(doc);
OutputStream stream = response.getOutputStream();

transformer.transform(domSource, new StreamResult(stream));

stream.flush();
stream.close();

where doc is a dom4j document, xsl is the inputstream containing above stylesheet and response is a HttpServletResponse object which will receive the transformation result.

Heiko Theißen
  • 12,807
  • 2
  • 7
  • 31
morbac
  • 301
  • 4
  • 16
  • If you have `` in your stylesheet, why do you also hard code `` and `` in the HTML the stylesheet creates? With the proper use of the Transformer it should create the right `meta` when serializing to HTML, based on the `xsl:output` directive. And how does your Java JAXP code look exactly, does it use a StreamResult to create an HTML file? – Martin Honnen Oct 28 '22 at 08:44
  • Hi, I added the code of JAXP transformation. I agree `meta` declaration is messed up, but the issue is not linked to this declaration imho since behaviour is obviously dependant on transformer, not on stylesheet nor browser. – morbac Oct 28 '22 at 09:05
  • As you are sending results directly to a browser, I suppose, with that servlet response, what does the network console show as the content type and perhaps charset for the response in the case of the messed up Xalan rendering? – Martin Honnen Oct 28 '22 at 09:28
  • I tried removing the `` from xsl + add both doctype-system versions in xsl:output, but it does not change anything. Wireshark shows be that transformation returns `
    `
    – morbac Oct 28 '22 at 09:33
  • It might be the known issue https://issues.apache.org/jira/browse/XALANJ-2419 though I can't tell for sure and I am not sure why my test with Xalan 2.7.1 at xsltransform.net seemed to work out. – Martin Honnen Oct 28 '22 at 09:42
  • I guess the xsltransform.net test might have worked out as there the result is not serialized to a stream and encoded as UTF-8 but rather passed around as an UTF-16 encoded string. – Martin Honnen Oct 28 '22 at 09:48
  • Xalan 2.7.1 gives me same wrong result than 2.7.2. But Xalan 2.6.0 gives me correct result ! I already experienced some troubles with xalan 2.7.* on encoding. It was for iso-8859-1 encoding and after bunch of research in xalan source, I discovered that xsl:output encoding had to be declared "iso8859-1" instead of "iso-8859-1". Not sure this is same issue here since I'm using utf-8, but I'll dive in xalan source again to find how attributes are handled. Thanks a lot for your help. I'll tell if I can find something. – morbac Oct 28 '22 at 10:05

3 Answers3

1

I have tried

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8" doctype-system="about:legacy-compat"/>
  <xsl:template match="/">
    <html>
      <head>
      </head>
      <body>
        <textarea></textarea><br/>
        <input type="text" value=""/>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

with Xalan 2.7.1 at http://xsltransform.net/ and both thumbs seems to be shown fine i.e. the serialized HTML is

<!DOCTYPE HTML SYSTEM "about:legacy-compat">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<textarea></textarea>
<br>
<input value="" type="text">
</body>
</html>

which renders as

enter image description here

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
1

I finally decided to fork xalan-java project and patch the serializer by myself. After compilation of the patch, I'm able to have correct emojis for both attributes and text with utf-8 xsl output.

The patch commit is following https://github.com/morbac/xalan-java/commit/a685171e1b621e9b63c8507f467a395fd1fc96a4. It fixes the issue for both input and textarea. The jar with fixed classes is available here

morbac
  • 301
  • 4
  • 16
0

After a day of research, I have come to the conclusion that this is a bug in xalan html serializer (line 1440 and following) with surogate characters (char between \ud800 and \udbff). As mentionned in comments, xalan 2.6.0 makes a correct transformation, but xalan 2.7.* does not.

Martin Honnen mentionned the XALANJ-2419. I also found other tickets related to this issue (XALANJ-2617, https://github.com/apache/xalan-j/pull/4, etc.) I tried to implement some fixes. For instance the version suggested here effectively fixes the issue for my <input> field but it remains the issue with textarea.

enter image description here

I'll try to fork xalan and fix the issue for both attribute and text. Meanwhile, the easiest way to workarround the issue is to change the replace the "UTF-8" encoding with "UTF-16" in xsl:output. This fixes both issues.

enter image description here

morbac
  • 301
  • 4
  • 16