Mixed Content and String Manipulation Clean Up

Question

I am in the middle of a very painful process of transforming a Word-based document into XML. I have run into the following problem:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
            first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>

<p> nodes have mixed content. <element> I have taken care of in a previous iteration. But now the problem is with quotes and sources that partially appear within <hi rend= "italics"/> and partially as text nodes.

How can I use XSLT 2.0 to:

match all <hi rend="italics"> nodes that are immediately preceded by the text node whose last character is "„"?
output the contents of <hi rend="italics"> as <quote>...</quote>, get rid of the quotation marks ("„" and "”"), but include within <quote/> any question and exclamation marks that appear as immediately following siblings of <hi rend="italics">?
convert the text node between "(" and ")" following the <hi rend="italics"> node as <source>...</source> without the brackets.
include the final full-stop.

In other words, my output should look like this:

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>

I have never dealt with mixed content and string manipulations like this and the whole thing is really throwing me off. I will be incredibly grateful for your tips.

The question marks and exclamation marks in your input document are outside of the `hi` element, but in the expected output, they are inside the `quote` element. This seems odd. Is is right? Please confirm. — Sean B. Durkin, Oct 02 '12 at 13:06

Dimitre Novatchev · Answer 1 · 2012-10-02T14:29:27.590

This transformation:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "hi[@rend='italics'
     and
      preceding-sibling::node()[1][self::text()[ends-with(., '„')]]
      ]">

  <quote>
    <xsl:value-of select=
     "concat(.,
             if(matches(following-sibling::text()[1], '^[?!]+'))
              then replace(following-sibling::text()[1], '^([?!]+).*$', '$1')
              else()
             )
      "/>
  </quote>
 </xsl:template>

 <xsl:template match="text()[true()]">
  <xsl:variable name="vThis" select="."/>
  <xsl:variable name="vThis2" select="translate($vThis, '„”?!', '')"/>

  <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/>
  <xsl:if test="contains($vThis2, '(')">
   <source>
    <xsl:value-of select=
      "substring-before(substring-after($vThis2, '('), ')')"/>
   </source>
   <xsl:value-of select="substring-after($vThis2, ')')"/>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
                quote</hi>?” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
                quote</hi>” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
                definitely a quote</hi>!” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
                first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
                well</hi>!?” (Source). </p>

</root>

produce the wanted, correct result:

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. <quote>Is this a
                quote?</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is a
                quote</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is
                definitely a quote!</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text.<quote>This is a
                first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as
                well!?</quote> <source>Source</source>. </p>

</root>

+1. I think, XSLT 2.0 given, for the text replacement of `(Source)` with `Source` in the text nodes I would prefer to use `analyze-string`. And I wonder whether all quote characters and punctuation marks in text nodes can simply be removed, as you do, or whether they only need to be removed when occurring right before or after or between those `hi` elements. — Martin Honnen, Oct 02 '12 at 13:36
On Saxon, this throws a lot of recoverable errors: ambiguous rule match for text(). — Sean B. Durkin, Oct 02 '12 at 14:14
@MartinHonnen, Yes, `xsl:analyze-string` is nice and I would use it if the problem was more complicated. As for the location of the characters to be removed, this is unclear from the current question -- can easily be done in any case. My purpose was to come up with a short solution -- which, I think, I did. — Dimitre Novatchev, Oct 02 '12 at 14:31
@SeanB.Durkin, Thanks for noticing this -- fixed now. Why I always think that `text()` is more specific than `node()` ? — Dimitre Novatchev, Oct 02 '12 at 14:33

Sean B. Durkin · Accepted Answer · 2012-10-02T16:11:26.903

Here is an alternative solution. It allows for a more narrative style input document (quotes within quotes, multiple (Source) fragments within one text node, '„' as data when not followed by a hi element).

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:so="http://stackoverflow.com/questions/12690177"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsl xs so">
<xsl:output omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />  

<xsl:template match="@*|comment()|processing-instruction()">
  <xsl:copy />
</xsl:template>

<xsl:template match="*">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:function name="so:clip-start" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" />
</xsl:function>

<xsl:function name="so:clip-end" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring-after($in-text,'”')" />
</xsl:function>

<xsl:function name="so:matches-start" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend='italics'] and
                        ends-with($text-node, '„')" />
</xsl:function>

<xsl:template match="text()[so:matches-start(.)]"    priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-start(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:function name="so:matches-end" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend='italics'] and
                        matches($text-node,'^[!?]*”')" />
</xsl:function>

<xsl:template match="text()[so:matches-end(.)]"   priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()" name="parse-text" priority="1">
  <xsl:param name="text" select="." />
  <xsl:analyze-string select="$text" regex="\(([^)]*)\)">
    <xsl:matching-substring>
      <source>
        <xsl:value-of select="regex-group(1)" />
      </source>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>

<xsl:template match="hi[@rend='italics']">
  <quote>
    <xsl:apply-templates select="(@* except @rend) | node()" />
    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,'^[!?]')]">
      <xsl:value-of select="replace(., '^([!?]+).*$', '$1')" />
    </xsl:for-each>   
  </quote>
</xsl:template>

</xsl:stylesheet>

Mixed Content and String Manipulation Clean Up

2 Answers2