I have a working process using XSLT 2.0/3.0 using Saxon-HE latest version that supports XSLT 3.0 that takes the text of a CDATA section in an XML file does markup into HTML. The text has no HTML tags but minimal plain text markup that relies on new lines and markup in square brackets at the beginning of lines. The working process (not shown here) uses multiple stages of setting the text in a variable, using replace functions with patterns and tons of <
, >
'
etc to gradually getting to the final step. Not only is it difficult to read, but it's not very extensible if I want to add another change to the markup. I've started trying to build a better markup process below but getting stuck.
Here is a small sample of my trimmed XML file structure:
<?xml version="1.0" encoding="UTF-8"?>
<project>
---------------------------------------------------
<document>
<docText><![CDATA[
[page 001] 1
[margin] Person1 to Person2
This Indenture made this x''th Day of y in the year z Between person1, grantor, of place1 to person2, grantee, of place2 for 5 dollars ... the s''d person1 to s''d person2 ... signed under my hand.
Witnesses present
[signed] Mrs. Jane Doe (seal)
[witness] Mr. Witness1
[witness] Ms. Witness1
Court office month x''th year
I do hereby certify that ... and is thereon truly admitted to Record
[clerk] John G. Reynolds DCCC
]]></docText>
<persons>
<person role="grantor">Jane Doe</person>
<person role="grantee">Bob Jones</person>
</persons>
</document>
---------------------------------------------------
<document>
<docText><![CDATA[
[page 002] 2
[margin] Person3 to Person4
This Indenture made this x''th Day of y in the year z Between person1, grantor, of place1 to person2, grantee, of place2 for 5 dollars ... the s''d person1 to s''d person2 ... signed under my hand.
Witnesses present
[signed] Mr. John Doe (seal)
[witness] Mr. Witness1
[witness] Ms. Witness1
[page 003] 3
Court office month x''th year
I do hereby certify that ... and is thereon truly admitted to Record
[clerk] John G. Reynolds DCCC
]]></docText>
<persons>
<person role="grantor">John Doe</person>
<person role="grantee">Bob Jones</person>
</persons>
</document>
</project>
These are some of the steps I want to take with the text in CDATA
- tokenize all lines using \n new line
- lines which begin with a word in square brackets (e.g., [witness]) are tagged with
<div>
using class in brackets (e.g.,<div class="witness">rest of line</div>
) - remaining lines are tagged with
<p>
tags - all blank lines are eliminated
- scan text in the
<div>
and<p>
text nodes above for further processing: - find any pair of single quotes (i.e. paired apostrophe) followed by 1 to 4 upper or lower case letters and place in
<sup></sup>
(e.g., 25''th becomes25<sup>th</sup>
) group adjacent
<div>
of same class name into outer<div>
of a certain name e.g.<div class="a">b</div> <div class="a">b</div> becomes <div class="a-outer"> <div class="a">b</div> <div class="a">b</div> </div>
- additional markup as needed.
I have what I want through step 6 (half of 5), though likely poor structure. This stylesheet works and gives me most of what I had in the much longer previous stylesheet and templates.
Here is a shortened version of my XSLT 3.0 stylesheet and templates:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="my:functions"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
exclude-result-prefixes="xsl xs my fn" expand-text="yes">
<xsl:output method="html" html-version="5.0" encoding="utf-8" indent="yes"/>
<xsl:template match="/">
<html>
<head>
<title>Test Title</title>
<style>
div {{background-color: pink;}}
p {{background-color: ; clear: right; margin-bottom: 0;}}
.clerk, .signed {{float:right;}}
.margin::before {{content: "[margin note:] ";}}
.clear {{clear: right;}}
</style>
</head>
<body>
<h2>Records</h2>
<xsl:apply-templates select="project/document"/>
</body>
</html>
</xsl:template>
<xsl:template match="document">
<article>
<h3><xsl:value-of select="persons/person[@role='grantor']"/> to
<xsl:value-of select="persons/person[@role='grantee']"/></h3>
<xsl:apply-templates select="docText"/> <!-- docText contains text inside CDATA section -->
<div class="clear"/>
</article><hr />
</xsl:template>
<!-- all lines of text are parsed here and tagged with either <p> or <div> and blank lines discarded-->
<xsl:template match="docText">
<xsl:variable name="vLines" select="fn:analyze-string(., '\n')" />
<xsl:for-each select="$vLines/fn:non-match">
<xsl:choose>
<xsl:when test="starts-with(.,'[')">
<xsl:variable name="v2" select="fn:analyze-string(.,'\[(witness|signed|clerk|margin)\]')"/>
<div class="{fn:replace($v2/fn:match , '\[(.*?)\]' , '$1')}">{$v2/fn:non-match}</div>
</xsl:when>
<xsl:otherwise>
<p>
<xsl:call-template name="tReplaceDblApos">
<xsl:with-param name="pText" select="."/>
</xsl:call-template>
</p>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
<!-- any 1 to 4 characters following two adjacent single quotes is tagged with <sup> without quotes-->
<xsl:template name="tReplaceDblApos">
<xsl:param name="pText"/>
<xsl:analyze-string select="$pText" regex="''([a-zA-Z]{{1,4}})">
<xsl:matching-substring>
<sup><xsl:value-of select="regex-group(1)"/></sup>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
I would appreciate any suggestion for better ways of accomplishing this type of markup and how to make it extensible and accomplish the last step listed for example. I've tried off and on the last several months to make the process simpler, and this is the closest I've gotten so far. Apologies for any misuse of terminology, the long example, and the novice state of the code.
Michael