I am in the middle of a very painful process of transforming a Word-based document into XML. I have run into the following problem:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p>
<element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
quote</hi>?” (Source). </p>
<p>
<element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
quote</hi>” (Source). </p>
<p>
<element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
definitely a quote</hi>!” (Source). </p>
<p>
<element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
well</hi>!?” (Source). </p>
</root>
<p>
nodes have mixed content. <element>
I have taken care of in a previous iteration. But now the problem is with quotes and sources that partially appear within <hi rend= "italics"/>
and partially as text nodes.
How can I use XSLT 2.0 to:
- match all
<hi rend="italics">
nodes that are immediately preceded by the text node whose last character is "„"? - output the contents of
<hi rend="italics">
as<quote>...</quote>
, get rid of the quotation marks ("„" and "”"), but include within<quote/>
any question and exclamation marks that appear as immediately following siblings of<hi rend="italics">
? - convert the text node between "(" and ")" following the
<hi rend="italics">
node as<source>...</source>
without the brackets. - include the final full-stop.
In other words, my output should look like this:
<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>
<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>
<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>
<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>.
</p>
</root>
I have never dealt with mixed content and string manipulations like this and the whole thing is really throwing me off. I will be incredibly grateful for your tips.