2

Here's a sample XML:

<?xml version="1.0" ?>
<someparent>
    <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I don't want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Nope, not that one</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Not that one either</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>
</someparent>

The <id> elements are always grouped: all elements with the same <id> value follow each other in the document. I may have thousands of different <id>s in a single file. What I want is to find each <somechild> element that is the first occurrence of its corresponding <id> group. So my expected result would be:

    <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>

I need a single XPATH command to select all of these "first items in a group". I have tried various combinations of following-sibling and preceding-sibling axes, but I can't get it just quite right. I have come very close to what I want to achieve with the following statement:

//someparent/somechild/id[text()=parent::somechild/preceding-sibling::somechild/id[text()]]/parent::somechild

This actually returns all the nodes I don't want, as it selects all the items that are not the first in their group (so it's essentially a perfect negative of what I want!). But for the life of me, I haven't been able to figure out how to reverse the results.

Any help woud be kindly appreciated.

Filipus
  • 520
  • 4
  • 12

3 Answers3

3

This O(n2) XPath 1.0 expression,

//someparent/somechild[not(id = preceding-sibling::somechild/id)]

will select all somechild elements that have no preceding siblings with the same id child element,

   <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>

as requested.


Update

Michael Kay noted helpfully that the above XPath has an algorithmic complexity of O(n2), because for each child sibling, all preceding siblings are compared. This won't matter for small numbers of siblings, but OP mentioned thousands, so the size issue becomes a concern.

See his XPath 3.1 solution, which is a much better O(n).

He further observed that an O(n) XPath 1.0 expression is possible as long as only immediately preceding siblings have to be checked:

//someparent/somechild[not(id = preceding-sibling::somechild[1]/id)]
                                                            ^^^

This lower complexity XPath will yield the same results for OP's sample case.

A differentiating case would involve later siblings with id values that repeat earlier clusters of id values. For example, adding another cluster of id siblings with 98 values:

<someparent>
  <somechild>
    <description>I want this</description>
    <id>98</id>
  </somechild>
  <somechild>
    <description>I don't want this</description>
    <id>98</id>
  </somechild>
  <somechild>
    <description>I want this too</description>
    <id>2</id>
  </somechild>
  <somechild>
    <description>Nope, not that one</description>
    <id>2</id>
  </somechild>
  <somechild>
    <description>Not that one either</description>
    <id>2</id>
  </somechild>
  <somechild>
    <description>Yep, I want this</description>
    <id>41</id>
  </somechild>
  <somechild>
    <description>REPEAT CASE 1</description>
    <id>98</id>
  </somechild>  
  <somechild>
    <description>REPEAT CASE 2</description>
    <id>98</id>
  </somechild>
</someparent>

The difference is that the O(n) XPath will not include the REPEAT CASE 1 somechild element, but the O(n2) XPath will include the distantly repeated REPEAT CASE 1:

<somechild>
    <description>I want this</description>
    <id>98</id>
</somechild>
<somechild>
    <description>I want this too</description>
    <id>2</id>
</somechild>
<somechild>
    <description>Yep, I want this</description>
    <id>41</id>
</somechild>
<somechild>
  <description>REPEAT CASE 1</description>
  <id>98</id>
</somechild>

As long as the requirements do not require non-immediate comparisons, use the more efficient O(n) XPath.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • 1
    Absolutely brilliant. Kicking myself in the proverbial backside for not figuring it out by myself (now that I see the solution, obviously). I was looking way too deep into this. Thanks @kjhughes! – Filipus Jul 02 '20 at 14:03
  • Do be aware that it's likely to have O(n^2) performance: don't use it if you have thousands of items. For a scalable solution, use XSLT 2.0 with ``. – Michael Kay Jul 02 '20 at 17:46
  • @MichaelKay: If OP has XPath only (possible, since no XSLT is listed or tagged in the question), is there a different XPath you'd recommend to improve on the O(n^2) complexity? – kjhughes Jul 02 '20 at 18:09
  • XPath 3.1 solution offered below – Michael Kay Jul 02 '20 at 18:27
  • @MichaelKay: Thank you! Now if only we could get all devs to upgrade to the cool functional features of XPath 3.1... – kjhughes Jul 02 '20 at 19:05
  • 1
    Actually, I think we both missed something here. The Q states "all elements with the same value follow each other in the document". Therefore if an element has a different ID from the immediately preceding sibling, it is different from all preceding siblings. Therefore in your solution, `preceding-sibling::somechild` can be replaced with `preceding-sibling::somechild[1]`, which makes the solution O(n). – Michael Kay Jul 03 '20 at 08:28
  • @MichaelKay: Good insight! I've added an explanation to the answer. Thank you. – kjhughes Jul 03 '20 at 14:27
  • @MichaelKay & kjhughes: excellent suggestions, both of you. You are right in noting that the Q states "all elements with the same value follow each other". It is, in fact, an ordered list. One thing I don't understand though is your use of the *preceding-sibling::somechild[1]* statements. It works, but it seems to me this would always compare with the very first *somechild* element. I would have thought that *preceding-sibling::somechild[position()-1]* would have been more appropriate, to only compare with the last element, but that doesn't work and I wonder why. – Filipus Jul 06 '20 at 16:45
  • 1
    Think of the index in `preceding-sibling::somechild[n]` as being `n` away from the current node, not as being in document order. – kjhughes Jul 06 '20 at 16:52
  • 1
    Numeric predicates in axis steps refer to positions in axis order, not in document order. One of those little quirks... I'm afraid that with XPath "I would have thought" reasoning can often let you down. – Michael Kay Jul 06 '20 at 17:52
  • But don't be frightened of reading the official spec: it's quite approachable. https://www.w3.org/TR/xpath-31/#id-predicate – Michael Kay Jul 06 '20 at 17:55
1

In XPath 3.1:

fold-left(//somechild, (), function($z, $i) {
    if ($i/id = $z[last()]/id) then $z else ($z, $i)
})

Unlike the accepted solution, this should have O(n) complexity (assuming that X[last()] executes in constant time).

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Another syntax similar to the solution proposed by @kjhughes :

//id[not(text()=preceding::id/text())]/..

Another solution :

//id[text()!=preceding::id[1]/text() or count(preceding::id)=0]/..

Select id when the first preceding id value is not equal to the value of the current id. Then select parent. Count is used to select the first id of the first somechild element.

Of course, using // could be replaced with absolute path to gain in efficiency.

E.Wiest
  • 5,425
  • 2
  • 7
  • 12