1

I'm trying to parse element contents and split them on a delimiter, while keeping all elements in the parent. I don't need -- don't want -- to find the delimiter inside the child elements.

<data>
  <parse-field>Some text <an-element /> more text; cheap win? ;
    <another-element>with delimiter;!</another-element>; final text</parse-field>
</data>

Should become

<data>
  <parsed-field>
    <field>Some text <an-element /> more text</field>
    <field>cheap win?</field>
    <field><another-element>with limiter;!</another-element></field>
    <field>final text</field>
  </parsed-field>
</data>

I've got a hacked-together solution that examines all "parse-field/text()" and replaces the delimiter with <token />, then a second pass to pick out the pieces around the<token>s, but it's... hacked. And unpleasant. I'm wondering if there's a better way.

I'm using XSLT-2.0, open to XSLT-1.0 solutions. SAXON processor.

Etheryte
  • 24,589
  • 11
  • 71
  • 116
Keith Davies
  • 215
  • 2
  • 11

2 Answers2

2

This is not (yet?) a complete answer, just an outline of a possible approach. If you would make your first pass something like:

<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="parse-field/text()">
    <xsl:call-template name="tokenize">
        <xsl:with-param name="text" select="."/>
    </xsl:call-template>
</xsl:template>

<xsl:template name="tokenize">
    <xsl:param name="text"/>
    <xsl:param name="delimiter" select="';'"/>
    <xsl:choose>
        <xsl:when test="contains($text, $delimiter)">
            <field>
                <xsl:value-of select="substring-before($text, $delimiter)"/>
            </field>
            <!-- recursive call -->
            <xsl:call-template name="tokenize">
                <xsl:with-param name="text" select="substring-after($text, $delimiter)"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:when test="position()=last()">
            <field><xsl:value-of select="$text"/></field>
        </xsl:when>
        <xsl:when test="$text">
            <text><xsl:value-of select="$text"/></text>
        </xsl:when>
    </xsl:choose>
</xsl:template>

you would obtain:

<?xml version="1.0" encoding="UTF-8"?>
<data>
   <parse-field>
      <text>Some text </text>
      <an-element/>
      <field> more text</field>
      <field> cheap win? </field>
      <another-element>with delimiter;!</another-element>
      <field/>
      <field> final text</field>
   </parse-field>
</data>

This is now a grouping problem, where elements of <parse-field> need to be grouped, with each group ending with <field>.

michael.hor257k
  • 113,275
  • 6
  • 33
  • 51
  • That's quite close to what I have, so it's a form of validation :) I used tokenize() instead of recursion, which leaves me with 'delimiting tokens' instead of end-of-field tokens. I was kind of hoping there was a bit of magic somewhere, something like muenchian grouping, that I just wasn't aware of... – Keith Davies Jun 02 '14 at 22:30
  • @KeithDavies I don't think you can use tokenize() here, because you need to generate two **types** of tokens. At least that's how I envisioned it. Once you have that, you can use grouping (in XSLT 2.0 you don't need to do Muenchian grouping, though you could). The main point here, IMHO, is that substrings are not nodes - so to do anything worthwhile you need to start with turning them into such. – michael.hor257k Jun 02 '14 at 22:43
  • oops, my mistake. An earlier iteration used tokenize(), unsatisfactorily. – Keith Davies Jun 03 '14 at 00:07
1

Best approach I've had so far, in simple form:

<xsl:variable name="delimiter" select="';'" />

<xsl:template match="foo">
  <xsl:copy>
    <xsl:apply-templates select="@*" />
    <xsl:call-template name="tokenize" />
  </xsl:copy>
</xsl:template>

<xsl:template name="tokenize">
  <xsl:variable name="rough">
    <xsl:apply-templates mode="tokenize" />
  </xsl:variable>
  <xsl:copy>
    <xsl:group-by select="$rough/*" group-ending-with="delimiter">
      <field><xsl:apply-templates select="current-group()[not(self::delimiter)]" /></field>
    </xsl:group>
  </xsl:copy>
</xsl:template>

<xsl:template match="*" mode="tokenize">
  <xsl:copy>
    <xsl:apply-templates select="@*|*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:template match="text()" mode="tokenize">
  <xsl:analyze-string select="." regex="([^{$delimiter}]*){$delimiter}">
    <xsl:matching-substring>
      <xsl:value-of select="regex-group(1)" /><delimiter/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>
Keith Davies
  • 215
  • 2
  • 11