Flatten child elements nested within text nodes

Question

There are a number of flattening questions here, but none deal with this level of complexity.

I have an xml document that looks something like:

<document>
<div class='target-one'>
    maybe some text node, maybe not...1
    <randomElement>
        maybe some text node, maybe not...2
    </randomElement>

    <div class='target-one'>
        <randomElement>
            maybe some text node, maybe not...3
        </randomElement>
    </div>
    maybe some text node, maybe not...4
    <randomElement>
        maybe some text node, maybe not...5
    </randomElement>

    <div class='target-two'>
        maybe some text node, maybe not...6
        <randomElement>
            maybe some text node, maybe not...7
        </randomElement>
    </div>
    maybe some text node, maybe not...8
    <randomElement>
        maybe some text node, maybe not...9
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...10
    <randomElement>
        maybe some text node, maybe not...11
    </randomElement>

    <div class='target-one'>
        <randomElement>
            maybe some text node, maybe not...12
        </randomElement>
    </div>
    maybe some text node, maybe not...13
    <randomElement>
        maybe some text node, maybe not...14
    </randomElement>

    <div class='target-two'>
        maybe some text node, maybe not...15
        <randomElement>
            maybe some text node, maybe not...16
        </randomElement>
    </div>
    maybe some text node, maybe not...17
    <randomElement>
        maybe some text node, maybe not...18
    </randomElement>
</div>

</document>

So there is a list of target elements which can be nested in any order. I would like to flatten them whenever they are nested by adding in more of the parent element to wrap the randomElement and nodes separately, while making the target children into target siblings. What I mean is that the output should look like:

<document>
<div class='target-one'>
    maybe some text node, maybe not...1
    <randomElement>
        maybe some text node, maybe not...2
    </randomElement>
</div>
<div class='target-one'>
    <randomElement>
        maybe some text node, maybe not...3
    </randomElement>
</div>
<div class='target-one'>
    maybe some text node, maybe not...4
    <randomElement>
        maybe some text node, maybe not...5
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...6
    <randomElement>
        maybe some text node, maybe not...7
    </randomElement>
</div>
<div class='target-one'>
    maybe some text node, maybe not...8
    <randomElement>
        maybe some text node, maybe not...9
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...10
    <randomElement>
        maybe some text node, maybe not...11
    </randomElement>
</div>
<div class='target-one'>
    <randomElement>
        maybe some text node, maybe not...12
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...13
    <randomElement>
        maybe some text node, maybe not...14
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...15
    <randomElement>
        maybe some text node, maybe not...16
    </randomElement>
</div>
<div class='target-two'>
    maybe some text node, maybe not...17
    <randomElement>
        maybe some text node, maybe not...18
    </randomElement>
</div>

</document>

So I wind up with many more of the parent divs, but all the text and the other nodes are in the right place. Please note that randomElement might be a div that is not a target class...

This is for reformatting ebooks for paging in an online library, so there might be an enormous number of elements before we actually hit a problem div. Thus we need some way to select all the elements and texts nodes in between problem children divs as a group, because if they are all wrapped in their own divs, it does no good - we will wind up with every p, em or span as its own page.

At the same time, most parent divs have no problem children. As long as the solution passes them through, I can clean up any empty divs with another run, but I do need this to work at least on a rudimentary level with text that has no child elements as well.

This is my first question on StackOverflow because I just don't get the recursion that would be necessary for this.

Thanks!

EDIT BASED ON THE ANSWER BY user52889. This never worked out but I am leaving it here for readability:

XSL that I can fire off in saxon:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"version="2.0">
<xsl:output method="html"
        indent="yes"
        encoding="utf-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>
<xsl:template match="/"> 
    <xsl:apply-templates />  
</xsl:template>
<xsl:template match="div[matches(@class,'target-one|target-two','i')]">
    <xsl:for-each select="node()">
        <xsl:choose>
            <xsl:when test="self::*[matches(@class,'target-one|target-two','i')]">
                <xsl:apply-templates select="."/>
            </xsl:when>
            <xsl:when test="preceding-sibling::node()[0][not(self::*[matches(@class,'target-one|target-two','i')])]">
                <!-- do nothing, it will be handled by the next case -->
            </xsl:when>
            <xsl:otherwise>
                <!--
      create a copy of the element matched by the template, with its attrs
      add to it the current node and all nodes which follow it, up to the next SIGNIFICANT node
      or, put another way, all following siblings which either
      a) do not have a preceding signficant node, or
      b) whose nearest preceding singificant node is the same as the nearest preceding significant node of the current node, i.e. its following sibling node is the current node.
    -->
                <xsl:element name="{../name()}">
                    <xsl:apply-templates select="../@*"/>
                    <xsl:apply-templates select="following-sibling::node()[
          not(preceding-sibling::*[matches(@class,'target-one|target-two','i')])
          or 
          count(preceding-sibling::*[matches(@class,'target-one|target-two','i')][0]/following-sibling::node()[0] | current()) = 1
        ]" />
                </xsl:element>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Current output from this file with children and duplicates:

<document>
<div class="target-one">
    <randomElement>
        maybe some text node, maybe not...2

    </randomElement>
    <div class="target-one"></div>
    maybe some text node, maybe not...4

    <randomElement>
        maybe some text node, maybe not...5

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...7

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...8

    <randomElement>
        maybe some text node, maybe not...9

    </randomElement>
</div>
<div class="target-one">
    <div class="target-one"></div>
    maybe some text node, maybe not...4

    <randomElement>
        maybe some text node, maybe not...5

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...7

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...8

    <randomElement>
        maybe some text node, maybe not...9

    </randomElement>
</div>
<div class="target-one"></div>
<div class="target-one">
    <randomElement>
        maybe some text node, maybe not...5

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...7

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...8

    <randomElement>
        maybe some text node, maybe not...9

    </randomElement>
</div>
<div class="target-one">
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...7

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...8

    <randomElement>
        maybe some text node, maybe not...9

    </randomElement>
</div>
<div class="target-two">
    <randomElement>
        maybe some text node, maybe not...7

    </randomElement>
</div>
<div class="target-two"></div>
<div class="target-one">
    <randomElement>
        maybe some text node, maybe not...9

    </randomElement>
</div>
<div class="target-one"></div>
<div class="target-two">
    <randomElement>
        maybe some text node, maybe not...11

    </randomElement>
    <div class="target-one"></div>
    maybe some text node, maybe not...13

    <randomElement>
        maybe some text node, maybe not...14

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...16

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...17

    <randomElement>
        maybe some text node, maybe not...18

    </randomElement>
</div>
<div class="target-two">
    <div class="target-one"></div>
    maybe some text node, maybe not...13

    <randomElement>
        maybe some text node, maybe not...14

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...16

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...17

    <randomElement>
        maybe some text node, maybe not...18

    </randomElement>
</div>
<div class="target-one"></div>
<div class="target-two">
    <randomElement>
        maybe some text node, maybe not...14

    </randomElement>
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...16

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...17

    <randomElement>
        maybe some text node, maybe not...18

    </randomElement>
</div>
<div class="target-two">
    <div class="target-two">
        <randomElement>
            maybe some text node, maybe not...16

        </randomElement>
    </div>
    <div class="target-two"></div>
    maybe some text node, maybe not...17

    <randomElement>
        maybe some text node, maybe not...18

    </randomElement>
</div>
<div class="target-two">
    <randomElement>
        maybe some text node, maybe not...16

    </randomElement>
</div>
<div class="target-two"></div>
<div class="target-two">
    <randomElement>
        maybe some text node, maybe not...18

    </randomElement>
</div>
<div class="target-two"></div>
</document>

score 2 · Accepted Answer · answered Jan 04 '15 at 10:27

Trying to treat it as a grouping problem I came up with

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

<xsl:param name="prefix" select="'target-'"/>

<xsl:output indent="yes"/>

<xsl:template match="document">
  <xsl:copy>
    <xsl:for-each-group select="descendant::text()[normalize-space()]"
      group-adjacent="generate-id(ancestor::div[starts-with(@class, $prefix)][1])">
      <xsl:apply-templates select="ancestor::div[starts-with(@class, $prefix)][1]" mode="g">
        <xsl:with-param name="group" select="current-group()"/>
      </xsl:apply-templates>
    </xsl:for-each-group>
  </xsl:copy>
</xsl:template>

<xsl:template match="*" mode="g">
  <xsl:param name="group"/>
  <xsl:if test=". intersect $group/ancestor::*">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:apply-templates select="node()" mode="g">
        <xsl:with-param name="group" select="$group"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:if>
</xsl:template>

<xsl:template match="text()" mode="g">
  <xsl:param name="group"/>
  <xsl:if test=". intersect $group">
    <xsl:copy/>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>

That basically groups any non white space text nodes descendants by the nearest ancestor div with the class you are looking for and then recreates the subtree contained in the ancestor with all grouped text nodes.

Works 100% even with random numbers of tags and nodes in between the divs. Even works with multiple levels of nesting. Accepted answer. — sgc, Jan 04 '15 at 11:36

score 1 · Answer 2 · answered Jan 03 '15 at 22:53

1

It's difficult to understand what in your example is a rule and what's just an example. The following stylesheet will produce the required result - perhaps that's what you're looking for. If not, edit your question and explain the logic behind the requested transformation.

XSLT 2.0 (or 1.0)

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/document">
    <document>
        <xsl:for-each select="//randomElement">
            <div class='{../@class}'>
                <xsl:copy-of select=". | preceding-sibling::text()[1]"/>
            </div>
        </xsl:for-each>
    </document>
</xsl:template>

</xsl:stylesheet>

answered Jan 03 '15 at 22:53

michael.hor257k

113,275
6
33
51

Interesting solution. I would never have thought that would output siblings! I have updated with more context right before the "Thanks!" in my question, because it is not possible to target each randomElement this way... Many thanks! – sgc Jan 04 '15 at 08:37
I am afraid you're still not being clear. If "*it is not possible to target each randomElement this way*" then how *is* it possible to target them? We only have your example to go by. – michael.hor257k Jan 04 '15 at 11:10
You can't because the content is random. This is why it is a hard case. the only way to do it is something like: run through the dom until you hit one of the target divs, select everything before it, wrap all that in a new div, output the div that was found, check if there are more nodes afterwards, rinse and repeat until everything is processed. Sorry for the example but there is a limit to the amount of text I can include here. the target divs are chapters, while there are a hundred plus other tags in the document. We must target the specific divs and not the hundreds of random elements. – sgc Jan 04 '15 at 11:28
No content is random (if it were, you couldn't write an algorithm to handle it). – michael.hor257k Jan 04 '15 at 11:46

score 0 · Answer 3 · answered Jan 03 '15 at 17:27

Sounds like you want something like the following, where SIGNIFICANT is some expression describing all those and only those elements you want to be your new list items (e.g. something like div[substring(@class,1,6)='target'])...

<xsl:template match="SIGNIFICANT">
  <xsl:for-each select="node()">
    <xsl:choose>
      <xsl:when test="self::SIGNIFICANT">
        <xsl:apply-templates select="."/>
      </xsl:when>
      <xsl:when test="preceding-sibling::node()[0][not(self::SIGNIFICANT)]">
        <!-- do nothing, it will be handled by the next case -->
      </xsl:when>
      <xsl:otherwise>
        <!--
          create a copy of the element matched by the template, with its attrs
          add to it the current node and all nodes which follow it, up to the next SIGNIFICANT node
          or, put another way, all following siblings which either
          a) do not have a preceding signficant node, or
          b) whose nearest preceding singificant node is the same as the nearest preceding significant node of the current node, i.e. its following sibling node is the current node.
        -->
        <xsl:element name="../name()">
          <xsl:apply-templates select="../@*"/>
          <xsl:apply-templates select="following-sibling::node()[
              not(preceding-sibling::SIGNIFICANT)
              or 
              count(preceding-sibling::SIGNIFICANT[0]/following-sibling::node()[0] | current()) = 1
            ]">
        </xsl:element>
      </xsl:otherwise>
  </xsl:for-each>
</xsl:template>

Note: this means a top-level div with no child nodes will be removed entirely. You could trivially wrap in a choose/when if you don't want that behaviour.

Note also: There may be a more performant way to do this recursively for extremely long lists.

I found a few typos (unclosed choose tag, needed to add in a * after self it seems, etc.). I did manage to run it on my xml above, but it does not quite work. I still get children divs and I am not quite sure why but there are duplicates. Things are not getting wrapped in the new tags, they are just inserted empty. I have no idea how to include all that code in comments, so I will edit above with appended code... — sgc, Jan 03 '15 at 18:27

Flatten child elements nested within text nodes

3 Answers3