0

I'm attempting to clean-up a batch of XML I've been provided. There are three situations I need to account for:

  1. some elements have plain text within them, eg. <item>some text</item>, which need to be wrapped in another tag, eg. <item><p>some text</p></item>
  2. some elements have escaped XML within them, eg. <item>&lt;p>some text&lt;/p></item>, which needs to be output without escaping: <item><p>some text</p></item>
  3. some elements have escaped XML which needs to be wrapped, eg. <item>some &lt;em>text&lt;/em></item> needs to become <item><p>some <em>text</em></p></item>

<item> is used as a container in both instances.

I can satisfy condition one relatively easily, and I can satisfy condition 2 with disable-output-escaping, but I can't satisfy condition 3 with this approach.

I think I can satisfy 2 (& possibly 3) if I can test whether the text within <item> is escaped, but a test using contains(., '&amp;lt;') doesn't match. So...

How can I test whether text within a node is escaped XML?

Phillip B Oldham
  • 18,807
  • 20
  • 94
  • 134
  • `contains(., '<')` works? – Max Toro Jul 09 '13 at 17:29
  • Which XSLT 1.0 processor exactly do you use? Have you checked whether an extension function is available or can be easily implemented that parses the content of the `item` elements into a tree fragment which could then be processed with normal templates as needed? – Martin Honnen Jul 09 '13 at 17:56
  • @MaxToro No, since that's essentially searching for `<` escaped for use within the test attribute of XSL. – Phillip B Oldham Jul 10 '13 at 07:12
  • @MartinHonnen updated the tags with relevant information, but essentially I'm using `libxml2` and `libxslt`. – Phillip B Oldham Jul 10 '13 at 07:58

1 Answers1

0
  1. and 3. both need wrapping and disable-output-escaping in 1. won't hurt so I think you can treat them together with the same template.

I don't see a clear check whether an element content contains an escaped element markup with pure XSLT 1.0 means, so I simply tried

<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/root">
  <html>
    <body>
      <xsl:apply-templates/>
    </body>
  </html>
</xsl:template>

<xsl:template match="@* | node()">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="item[not(*) and not(starts-with(., '&lt;') and substring(., string-length(.)) = '&gt;')]">
  <xsl:copy>
    <p>
      <xsl:value-of select="." disable-output-escaping="yes"/>
    </p>
  </xsl:copy>
</xsl:template>

<xsl:template match="item[not(*)
                          and starts-with(., '&lt;') and substring(., string-length(.)) = '&gt;']">
  <xsl:copy>
    <xsl:value-of select="." disable-output-escaping="yes"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

which transforms

<root>
<item>some text</item>
<item>&lt;p>some text&lt;/p></item>
<item>some &lt;em>text&lt;/em></item>
</root>

into

<html><body>
<item><p>some text</p></item>
<item><p>some text</p></item>
<item><p>some <em>text</em></p></item>
</body></html>

Obviously it would transform <item>&lt;...></item> as well into <item><...></item>. You could try to implement some more string checks but without a complete parser for the escaped XML fragment it is always possible to construct input samples where the string checks fail.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Thanks Martin. I've come to the same conclusion that I'll need to parse the escaped content to ensure it is well-formed, so I'm going to add a callback to the language I'm using (Python+lxml) to handle this. – Phillip B Oldham Jul 10 '13 at 12:26