Parsing plain text in CDATA to html with XSLT 2.0/3.0 using multiple steps. Part way there

Question

I have a working process using XSLT 2.0/3.0 using Saxon-HE latest version that supports XSLT 3.0 that takes the text of a CDATA section in an XML file does markup into HTML. The text has no HTML tags but minimal plain text markup that relies on new lines and markup in square brackets at the beginning of lines. The working process (not shown here) uses multiple stages of setting the text in a variable, using replace functions with patterns and tons of <, > ' etc to gradually getting to the final step. Not only is it difficult to read, but it's not very extensible if I want to add another change to the markup. I've started trying to build a better markup process below but getting stuck.

Here is a small sample of my trimmed XML file structure:

<?xml version="1.0" encoding="UTF-8"?>
<project>
---------------------------------------------------
<document>
<docText><![CDATA[
[page 001] 1
[margin] Person1 to Person2
This Indenture made this x''th Day of y in the year z Between person1,     grantor, of place1 to person2, grantee, of place2 for 5 dollars ... the s''d person1 to s''d person2 ... signed under my hand.

Witnesses present
[signed] Mrs. Jane Doe (seal)
[witness] Mr. Witness1
[witness] Ms. Witness1

Court office month x''th year
I do hereby certify that ... and is thereon truly admitted to Record
[clerk] John G. Reynolds DCCC
]]></docText>
<persons>
<person role="grantor">Jane Doe</person>
<person role="grantee">Bob Jones</person>
</persons>
</document>
---------------------------------------------------
<document>
<docText><![CDATA[
[page 002] 2
[margin] Person3 to Person4
This Indenture made this x''th Day of y in the year z Between person1, grantor, of place1 to person2, grantee, of place2 for 5 dollars ... the s''d person1 to s''d person2 ... signed under my hand.

Witnesses present
[signed] Mr. John Doe (seal)
[witness] Mr. Witness1
[witness] Ms. Witness1

[page 003] 3

Court office month x''th year
I do hereby certify that ... and is thereon truly admitted to Record
[clerk] John G. Reynolds DCCC
]]></docText>
<persons>
<person role="grantor">John Doe</person>
<person role="grantee">Bob Jones</person>
</persons>
</document>
</project>

These are some of the steps I want to take with the text in CDATA

tokenize all lines using \n new line
lines which begin with a word in square brackets (e.g., [witness]) are tagged with <div> using class in brackets (e.g., <div class="witness">rest of line</div>)
remaining lines are tagged with <p> tags
all blank lines are eliminated
scan text in the <div> and <p> text nodes above for further processing:
find any pair of single quotes (i.e. paired apostrophe) followed by 1 to 4 upper or lower case letters and place in <sup></sup> (e.g., 25''th becomes 25<sup>th</sup>)

group adjacent <div> of same class name into outer <div> of a certain name e.g.

<div class="a">b</div>
<div class="a">b</div>
becomes
<div class="a-outer">
<div class="a">b</div>
<div class="a">b</div>
</div>

additional markup as needed.

I have what I want through step 6 (half of 5), though likely poor structure. This stylesheet works and gives me most of what I had in the much longer previous stylesheet and templates.

Here is a shortened version of my XSLT 3.0 stylesheet and templates:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" 
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:xs="http://www.w3.org/2001/XMLSchema" 
xmlns:my="my:functions" 
xmlns:fn="http://www.w3.org/2005/xpath-functions"
exclude-result-prefixes="xsl xs my fn" expand-text="yes">

<xsl:output method="html" html-version="5.0" encoding="utf-8"  indent="yes"/>
<xsl:template match="/">
<html>
      <head>
      <title>Test Title</title>
      <style>
    div {{background-color: pink;}}
    p {{background-color: ; clear: right; margin-bottom: 0;}}
    .clerk, .signed {{float:right;}}
    .margin::before {{content: "[margin note:] ";}}
    .clear {{clear: right;}}
      </style>
      </head>
      <body>
           <h2>Records</h2>
           <xsl:apply-templates select="project/document"/>
      </body>
 </html>
 </xsl:template>

 <xsl:template match="document">
      <article>
      <h3><xsl:value-of select="persons/person[@role='grantor']"/> to 
      <xsl:value-of select="persons/person[@role='grantee']"/></h3>
      <xsl:apply-templates select="docText"/> <!-- docText contains text inside CDATA section -->
      <div class="clear"/>
      </article><hr />
 </xsl:template>

 <!-- all lines of text are parsed here and tagged with either <p> or  <div> and blank lines discarded-->
<xsl:template match="docText">
<xsl:variable name="vLines" select="fn:analyze-string(., '\n')" />
<xsl:for-each select="$vLines/fn:non-match">
<xsl:choose>
<xsl:when test="starts-with(.,'[')">
    <xsl:variable name="v2" select="fn:analyze-string(.,'\[(witness|signed|clerk|margin)\]')"/>
    <div class="{fn:replace($v2/fn:match , '\[(.*?)\]' , '$1')}">{$v2/fn:non-match}</div>
</xsl:when>
<xsl:otherwise>
    <p>
    <xsl:call-template name="tReplaceDblApos">
    <xsl:with-param name="pText" select="."/>
    </xsl:call-template>
    </p>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>

 <!-- any 1 to 4 characters following two adjacent single quotes is tagged with <sup> without quotes-->
 <xsl:template name="tReplaceDblApos">
 <xsl:param name="pText"/>
 <xsl:analyze-string select="$pText" regex="''([a-zA-Z]{{1,4}})">
 <xsl:matching-substring>
      <sup><xsl:value-of select="regex-group(1)"/></sup>
 </xsl:matching-substring>
 <xsl:non-matching-substring>
      <xsl:value-of select="."/>
 </xsl:non-matching-substring>
 </xsl:analyze-string>
 </xsl:template>

 </xsl:stylesheet>

I would appreciate any suggestion for better ways of accomplishing this type of markup and how to make it extensible and accomplish the last step listed for example. I've tried off and on the last several months to make the process simpler, and this is the closest I've gotten so far. Apologies for any misuse of terminology, the long example, and the novice state of the code.

Michael

score 2 · Accepted Answer · answered Jul 31 '17 at 09:18

Here is an attempt to do the grouping directly on the lines tokenized with the tokenize function:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    xmlns:mf="http://example.com/mf"
    exclude-result-prefixes="xs math mf"
    version="3.0">

    <xsl:output method="html" html-version="5.0" encoding="utf-8"  indent="yes"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Test Title</title>
                <style>
                    div {{background-color: pink;}}
                    p {{background-color: ; clear: right; margin-bottom: 0;}}
                    .clerk, .signed {{float:right;}}
                    .margin::before {{content: "[margin note:] ";}}
                    .clear {{clear: right;}}
                </style>
            </head>
            <body>
                <h2>Records</h2>
                <xsl:apply-templates select="project/document"/>
            </body>
        </html>
    </xsl:template>

    <xsl:template match="document">
        <article>
            <h3><xsl:value-of select="persons/person[@role='grantor']"/> to 
                <xsl:value-of select="persons/person[@role='grantee']"/></h3>
            <xsl:apply-templates select="docText"/> <!-- docText contains text inside CDATA section -->
            <div class="clear"/>
        </article><hr />
    </xsl:template>

    <!-- all lines of text are parsed here and tagged with either <p> or  <div> and blank lines discarded-->
    <xsl:template match="docText">
        <xsl:for-each-group select="tokenize(., '\n')[normalize-space()]" group-adjacent="string(analyze-string(., '^\[(witness|signed|clerk|margin)\]')//*:match/*:group)">
            <xsl:choose>
                <xsl:when test="current-grouping-key() and current-group()[2]">
                    <div class="{current-grouping-key()}-outer">
                        <xsl:apply-templates select="current-group()" mode="wrap-div">
                            <xsl:with-param name="class" select="current-grouping-key()"/>
                        </xsl:apply-templates>
                    </div>
                </xsl:when>
                <xsl:when test="current-grouping-key()">
                    <xsl:apply-templates select="current-group()" mode="wrap-div">
                        <xsl:with-param name="class" select="current-grouping-key()"/>
                    </xsl:apply-templates>                    
                </xsl:when>
                <xsl:otherwise>
                    <xsl:apply-templates select="current-group()" mode="wrap-p"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each-group>

    </xsl:template>

    <xsl:template match=".[. instance of xs:string]" mode="wrap-div">
        <xsl:param name="class"/>
        <div class="{$class}">
            <xsl:value-of select="replace(., '^\[.*?\]', '')"/>
        </div>
    </xsl:template>

    <xsl:template match=".[. instance of xs:string]" mode="wrap-p">
        <p>
            <xsl:sequence select="mf:rep-quotes(.)"/>
        </p>
    </xsl:template>

    <xsl:function name="mf:rep-quotes">
        <xsl:param name="input" as="xs:string"/>
        <xsl:analyze-string select="$input" regex="''([a-zA-Z]{{1,4}})">
            <xsl:matching-substring>
                <sup><xsl:value-of select="regex-group(1)"/></sup>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:function>

</xsl:stylesheet>

Output I get is

<!DOCTYPE HTML>
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <title>Test Title</title><style>
                    div {{background-color: pink;}}
                    p {{background-color: ; clear: right; margin-bottom: 0;}}
                    .clerk, .signed {{float:right;}}
                    .margin::before {{content: "[margin note:] ";}}
                    .clear {{clear: right;}}
                </style></head>
   <body>
      <h2>Records</h2>
      <article>
         <h3>Jane Doe to 
            Bob Jones
         </h3>
         <p>[page 001] 1</p>
         <div class="margin"> Person1 to Person2</div>
         <p>This Indenture made this x<sup>th</sup> Day of y in the year z Between person1,     grantor, of place1 to person2, grantee,
            of place2 for 5 dollars ... the s<sup>d</sup> person1 to s<sup>d</sup> person2 ... signed under my hand.
         </p>
         <p>Witnesses present</p>
         <div class="signed"> Mrs. Jane Doe (seal)</div>
         <div class="witness-outer">
            <div class="witness"> Mr. Witness1</div>
            <div class="witness"> Ms. Witness1</div>
         </div>
         <p>Court office month x<sup>th</sup> year
         </p>
         <p>I do hereby certify that ... and is thereon truly admitted to Record</p>
         <div class="clerk"> John G. Reynolds DCCC</div>
         <div class="clear"></div>
      </article>
      <hr>
      <article>
         <h3>John Doe to 
            Bob Jones
         </h3>
         <p>[page 002] 2</p>
         <div class="margin"> Person3 to Person4</div>
         <p>This Indenture made this x<sup>th</sup> Day of y in the year z Between person1, grantor, of place1 to person2, grantee, of
            place2 for 5 dollars ... the s<sup>d</sup> person1 to s<sup>d</sup> person2 ... signed under my hand.
         </p>
         <p>Witnesses present</p>
         <div class="signed"> Mr. John Doe (seal)</div>
         <div class="witness-outer">
            <div class="witness"> Mr. Witness1</div>
            <div class="witness"> Ms. Witness1</div>
         </div>
         <p>[page 003] 3</p>
         <p>Court office month x<sup>th</sup> year
         </p>
         <p>I do hereby certify that ... and is thereon truly admitted to Record</p>
         <div class="clerk"> John G. Reynolds DCCC</div>
         <div class="clear"></div>
      </article>
      <hr>
   </body>
</html>

Martin, thanks for the time and effort in providing a complete working solution. I will study your solution tonight and test it with a more complete version of my .xml file and see if I can add other steps. I just started adding the XSLT3 components a few days ago, so I'm unfamiliar with a few things in the solution that might be specific to that version. Specifically, I need to study what the `.[. instance of xs:string]` code does. I hadn't thought of using the `group-adjacent` option, seems like a great solution. Thanks, Michael — Sawtooth67, Jul 31 '17 at 16:48
With XSLT 3.0, you can match on atomic values too, not only on nodes. The notation `.[. instance of xs:string]` as a pattern matches on string values so this is a template to process the strings from e.g. `xsl:apply-templates select="current-group()"` that are processed by the `for-each-group`. The syntax is called a predicate pattern https://www.w3.org/TR/xslt-30/#doc-xslt30-patterns-PredicatePattern. It needs some time to get used to that approach but as breaking up XML processing in XSLT into templates it can also help breaking up plain string processing into templates. — Martin Honnen, Jul 31 '17 at 16:58
I've applied your solution to my .xml document and I'm getting most of the results I need. I changed the `` to `` so that text in the `
` also received the markup. I'll need to add some additional conditional statements to handle the bracketed formatting instructions that get converted to `
` for the more complex needs of my real document. Would you comment on what is happening with the `//*:match/*:group` after the group-adjacent instruction. Thank you for the help. M. — Sawtooth67, Aug 01 '17 at 01:09
The result of the `analyze-string` function returns some XML containing `match` and `group` elements and as the used regular expression `^\[(witness|signed|clerk|margin)\]` includes the square brackets but later on only the word inside those brackets is needed for the class I have decided to use only the word inside the brackets, captured as the first and only group in the regular expression, as the grouping key for the `group-adjacent`. The use of e.g. `*:match` is just a namespace agnostic selection of the `match` element(s) returned by the analyze-string function. — Martin Honnen, Aug 01 '17 at 08:06
Okay, that helps. I've not seen or used the agnostic namespace reference before. So this is like where I was using `fn:match` in the original template. I've just looked through the reference at [https://www.w3.org/TR/xpath-functions-31/#func-analyze-string](https://www.w3.org/TR/xpath-functions-31/#func-analyze-string) and see there are 3 type of elements returned: `fn:match `, `fn:non-match `, and `fn:group`. I've learned a lot from your help. Thanks much - Michael — Sawtooth67, Aug 01 '17 at 11:20

Parsing plain text in CDATA to html with XSLT 2.0/3.0 using multiple steps. Part way there

1 Answers1