Finding the lowest common ancestor of an XML node-set

Question

I have a node set constructed using the xsl:key structure in XSLT. I would like to find the lowest common ancestor (LCA) of all of the nodes in this node-set - any ideas?

I know about Kaysian intersects and XPath's intersect function, but these seem to be geared towards finding the LCA of just a pair of elements: I don't know in advance how many items will be in each node-set.

I was wondering if there might be a solution using a combination of the 'every' and 'intersect' expressions, but I haven't been able to think of one yet!

Thanks in advance, Tom

If anyone wants to know the bigger picture here, I'm moving footnotes in a book from one lump at the end to the lowest level from which they're referenced in the text. — Tom Hillman, Jan 05 '12 at 12:30

score 1 · Answer 1 · answered Jan 05 '12 at 13:22

I tried the following:

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:mf="http://example.com/mf"
  exclude-result-prefixes="xs mf"
  version="2.0">

  <xsl:output method="html" indent="yes"/>

  <xsl:function name="mf:lca" as="node()?">
    <xsl:param name="nodes" as="node()*"/>
    <xsl:variable name="all-ancestors" select="$nodes/ancestor::node()"/>
    <xsl:sequence
      select="$all-ancestors[every $n in $nodes satisfies exists($n/ancestor::node() intersect .)][last()]"/>
  </xsl:function>

  <xsl:template match="/">
    <xsl:sequence select="mf:lca(//foo)"/>
  </xsl:template>

</xsl:stylesheet>

Tested with the sample

<root>
  <anc1>
    <anc2>
      <foo/>
      <bar>
        <foo/>
      </bar>
      <bar>
        <baz>
          <foo/>
        </baz>
      </bar>
    </anc2>
  </anc1>
</root>

I get the anc2 element but I haven't tested with more complex settings and don't have the time now. Maybe you can try with your sample data and report back whether you get the results you want.

This looks great, although I think I've yet to satisfy myself as to why it's [last()] rather than [1] - possibly it would be different if you'd directly used $nodes/ancestor::* rather than $all-ancestors? — Tom Hillman, Jan 05 '12 at 14:26
The nice thing about this answer is that it's pure XPath - may come in handy for QA testing, even if I'm using Dimitre's solution in XSLT. — Tom Hillman, Jan 05 '12 at 15:33
Martin, You may be interested in a faster algorithm -- I updated my answer with what I believe to be an optimal algorithm for LCA. — Dimitre Novatchev, Jan 06 '12 at 03:50

Dimitre Novatchev · Accepted Answer · 2012-01-06T03:45:35.133

Here is a bottom-up approach:

 <xsl:function name="my:lca" as="node()?">
  <xsl:param name="pSet" as="node()*"/>

  <xsl:sequence select=
   "if(not($pSet))
      then ()
      else
       if(not($pSet[2]))
         then $pSet[1]
         else
           if($pSet intersect $pSet/ancestor::node())
             then
               my:lca($pSet[not($pSet intersect ancestor::node())])
             else
               my:lca($pSet/..)
   "/>
 </xsl:function>

A test:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:my="my:my">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

    <xsl:variable name="vSet1" select=
      "//*[self::A.1.1 or self::A.2.1]"/>

    <xsl:variable name="vSet2" select=
      "//*[self::B.2.2.1 or self::B.1]"/>

    <xsl:variable name="vSet3" select=
      "$vSet1 | //B.2.2.2"/>

 <xsl:template match="/">
<!---->
     <xsl:sequence select="my:lca($vSet1)/name()"/>
     =========

     <xsl:sequence select="my:lca($vSet2)/name()"/>
     =========

     <xsl:sequence select="my:lca($vSet3)/name()"/>

 </xsl:template>

 <xsl:function name="my:lca" as="node()?">
  <xsl:param name="pSet" as="node()*"/>

  <xsl:sequence select=
   "if(not($pSet))
      then ()
      else
       if(not($pSet[2]))
         then $pSet[1]
         else
           if($pSet intersect $pSet/ancestor::node())
             then
               my:lca($pSet[not($pSet intersect ancestor::node())])
             else
               my:lca($pSet/..)
   "/>
 </xsl:function>
</xsl:stylesheet>

When this transformation is applied on the following XML document:

<t>
    <A>
        <A.1>
            <A.1.1/>
            <A.1.2/>
        </A.1>
        <A.2>
            <A.2.1/>
        </A.2>
        <A.3/>
    </A>
    <B>
        <B.1/>
        <B.2>
            <B.2.1/>
            <B.2.2>
                <B.2.2.1/>
                <B.2.2.2/>
            </B.2.2>
        </B.2>
    </B>
</t>

the wanted, correct result is produced for all three cases:

     A
     =========

     B
     =========

     t

Update: I have what I think is probably the most efficient algorithm.

The idea is that the LCA of a node-set is the same as the LCA of just two nodes of this node-set: the "leftmost" and the "rightmost" ones. The proof that this is correct is left as an exercise for the reader :)

Here is a complete XSLT 2.0 implementation:

<xsl:stylesheet version="2.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:my="my:my">
        <xsl:output omit-xml-declaration="yes" indent="yes"/>

        <xsl:variable name="vSet1" select=
          "//*[self::A.1.1 or self::A.2.1]"/>

        <xsl:variable name="vSet2" select=
          "//*[self::B.2.2.1 or self::B.1]"/>

        <xsl:variable name="vSet3" select=
          "$vSet1 | //B.2.2.2"/>

     <xsl:template match="/">
         <xsl:sequence select="my:lca($vSet1)/name()"/>
         =========

         <xsl:sequence select="my:lca($vSet2)/name()"/>
         =========

         <xsl:sequence select="my:lca($vSet3)/name()"/>

     </xsl:template>

     <xsl:function name="my:lca" as="node()?">
      <xsl:param name="pSet" as="node()*"/>

      <xsl:sequence select=
       "if(not($pSet))
          then ()
          else
           if(not($pSet[2]))
             then $pSet[1]
             else
              for $n1 in $pSet[1],
                  $n2 in $pSet[last()]
               return my:lca2nodes($n1, $n2)
       "/>
     </xsl:function>

     <xsl:function name="my:lca2nodes" as="node()?">
      <xsl:param name="pN1" as="node()"/>
      <xsl:param name="pN2" as="node()"/>

      <xsl:variable name="n1" select=
       "($pN1 | $pN2)
                    [count(ancestor-or-self::node())
                    eq
                     min(($pN1 | $pN2)/count(ancestor-or-self::node()))
                    ]
                     [1]"/>

      <xsl:variable name="n2" select="($pN1 | $pN2) except $n1"/>

      <xsl:sequence select=
       "$n1/ancestor-or-self::node()
                 [exists(. intersect $n2/ancestor-or-self::node())]
                     [1]"/>
     </xsl:function>
</xsl:stylesheet>

when this transformation is performed on the same XML document (above), the same correct result is produced, but much faster -- especially if the size of the node-set is big:

 A
 =========

 B
 =========

 t

Brilliant. it looks to me like Martin's code will also work, but that this will scale better, and will be more easily read by future colleagues. Thanks very much, will go and test it now! — Tom Hillman, Jan 05 '12 at 14:29
@yamahito: You are welcome. I edited my answer with a slightly changed solution (the `descendant::` axis is no more used) that might be more efficient, because the set of ancestors is "linear", while the set of desendents may be "quadratic". — Dimitre Novatchev, Jan 05 '12 at 14:43
@yamahito: I updated my answer with what I think is probably the one of the fastest algorithms -- only two nodes are compared. With large number of nodes it executes much faster than my previous algorithm and Martin's algorithm. — Dimitre Novatchev, Jan 06 '12 at 03:48
The observation that the LCA of a set of nodes is the same as the LCA of the two nodes that come first and last in document order is indeed very powerful (I wish I knew how to prove it...). However, I think my function for computing the LCA of two nodes may be better than Dimitre's on many implementations - though the only way to find out is to measure it. I think Dimitre's code also assumes that $pSet is in document order: to force this, one should probably form `./$pSet` . — Michael Kay, Jan 06 '12 at 11:05
hmm... will that still force document order even if I'm producing my node-set from a key? — Tom Hillman, Jan 06 '12 at 11:11
(I think the proof must be along the lines: if A is an ancestor of P and Q, then it is also an ancestor of every node between P and Q in document order. That's good enough a proof for me, though I doubt it would satisfy a mathematician.) — Michael Kay, Jan 06 '12 at 11:11
yamahito, the result of an expression containing a '/' is always in document order. — Michael Kay, Jan 06 '12 at 11:12
But if I'm forming the node-set from a key, the context node isn't well defined, I think. — Tom Hillman, Jan 06 '12 at 11:37
@yamahito: Yes. the W3C XSLT 2.0 Spec says: "The result of the function is a sequence of nodes, in document order and with duplicates removed" -- http://www.w3.org/TR/2007/REC-xslt20-20070123/#keys — Dimitre Novatchev, Jan 06 '12 at 13:11

score 0 · Answer 3 · answered Jan 05 '12 at 13:55

0

Martin's solution will work, but I think it could be quite expensive in some situations, with a lot of elimination of duplicates. I'd be inclined to use an approach that finds the LCA of two nodes, and then use this recursively, on the theory that LCA(x,y,z) = LCA(LCA(x,y),z) [a theory which I leave the reader to prove...].

Now LCA(x,y) can be found fairly efficiently by looking at the sequences x/ancestor-or-self::node() and y/ancestor-or-self::node(), truncating both sequences to the length of the shorter, and then finding the last node that is in both: in XQuery notation:

( let $ax := $x/ancestor-or-self::node()
  let $ay := $y/ancestor-or-self::node()
  let $len := min((count($ax), count($ay))
  for $i in reverse($len to 1) 
  where $ax[$i] is $ay[$i]
  return $ax[$i]
)[1]

answered Jan 05 '12 at 13:55

Michael Kay

156,231
11
92
164

Hi Michael, thanks for taking the time to look at this. I'm not sure how I could apply your answer in this scenario, though, as I don't know how many nodes there will be in the node-set (actually in the vast majority of cases, there will just be one), and I'm therefore unsure how I would recurse between pairs of nodes within that node sets (if there are any). Also apologies for mis-spelling of Kaysian in the question! – Tom Hillman Jan 05 '12 at 14:24
@Michael Kay: You may be interested in a faster algorithm -- I updated my answer with what I believe to be an optimal algorithm for LCA. – Dimitre Novatchev Jan 06 '12 at 03:51

Finding the lowest common ancestor of an XML node-set

3 Answers3

Linked