XPath: finding nodes duplicated n times with a single path expression query

Question

I am practising writing some XPath queries and am stuck at one particular. Below is a sample document I am using:

<dept-db>
  <dept>
    <name>HR</name>
      <emp>
        <name>John</name>
        <country>USA</country>
      </emp>
      <emp>
        <name>Chris</name>
        <country>USA</country>
      </emp>
  </dept>
  <dept>
    <name>Technology</name>
    <emp>
      <name>Oliver</name>
      <country>UK</country>
    </emp>
    <emp>
      <name>Emily</name>
      <country>USA</country>
    </emp>
  </dept>
</dept-db>

What I want to achieve is to retrieve all employees whose country appears more than twice in the document. I started with a simpler query, namely one which is supposed to find duplicates:

<!-- language: lang-xsl -->
doc("emp.xml")//emp[preceding::emp/country=./country or following::emp/country=./country]

though it returns all the employees (obviously Oliver should not be listed among the results).

I'm new to XPath and am not quite sure if I get the concept of the dot '.' specifier right. I expect the aforementioned query to behave like this: iterate over the set of emp nodes and for each check if there's an employee with the same country among the nodes that appear above and below the current one in the document.

I'd be thankful for an explanation (the application of the dot specifier to perform GROUP BY kind of queries) and help with getting the query to work (unless it is not possible with a single path expression?). If it matters, I'm using eXide (part of eXist-db 2.1) with XQuery 3.0 to perform queries.

score 5 · Answer 1 · answered Feb 09 '14 at 22:42

5

In XPath 2.0, you can do

//emp[count(index-of(//country/text(), country/text())) > 2]

index-of will indicate the indexes of occurrences of country/text() throughout the document, then all we need to do is count them and check there are more than 2.

answered Feb 09 '14 at 22:42

Robin

9,415
3
34
45

Apparently the problem was not in my query itself, as I re-ran it in another environment and it worked (see my comment to Jens Erat's answer). Thank you for letting me know about this alternative solution though :) – Quintofron Feb 10 '14 at 01:07

score 3 · Answer 2 · answered Feb 10 '14 at 12:20

If you are stuck with XQuery 1.0, you can do it in a single expression, but you need to bind the source document to a variable. I have used $src. This works because you effectively access the source document twice and join in the predicate:

$src//emp[let $emp-country := country return count($src//data(country)[. = $emp-country]) > 2]

You could also rewrite this, to make it a little clearer:

let $all-countries := $src//data(country)
return
    $src//emp[let $emp-country := country return count($all-countries[. = $emp-country]) > 2]

Jens Erat · Answer 3 · 2014-02-09T22:44:51.250

2

As you're able to use XQuery 3.0's group by clauses, I'd go for that. This query groups the employees by country and only returns those from countries that occur more than two times:

for $employee in //emp
let $country := $employee/country
group by $country
where count($employee) > 2
return $employee

Regarding your approach:

I cannot reproduce any issues with your query. Using eXist DB's online demo, I'm not getting any "Oliver" in the results. It also works fine using BaseX and Zorba. Are you sure there is no second UK employee in your document?
You wrote "whose country appears more than twice": This is what I implemented above. Looking at your query, you might have wanted "at least twice"? If so, change the where clause to fit your requirements. If not, the problem in your query is that you might want to use and instead of or, but this will omit the first and last employee for that country.

edited Feb 09 '14 at 22:44

answered Feb 09 '14 at 22:38

Jens Erat

37,523
16
80
96

Thanks for mentioning BaseX, I ran my query there and it worked indeed. Previously I used a local version of eXist-db and apparently it produces different results in this case (and a few other queries that I've checked), no idea why. As for the query you propose, is group by .. where .. a correct expression? The BaseX XQuery processor returns [XPST0003] Expecting valid expression after 'group by'. – Quintofron Feb 10 '14 at 01:01
This code is definitively a correct expression. Actually, I wrote that code inside BaseX. What version do you have problems in, and what's your exact query? Have you created and opened the file as a database, or how do you access it? – Jens Erat Feb 10 '14 at 01:08
I've tried both the Live demo and BaseX 7.0.2 (current version in the aptitude repository). I've created a database including the .xml file and opened it via GUI. – Quintofron Feb 10 '14 at 01:25
Just realized that XQuery expression also uses extended FLWOR expressions. BaseX 7.0.2 is really old from 2011, if you want to use BaseX think of manually loading the jar file from with lots of improvements. – Jens Erat Feb 10 '14 at 10:31

XPath: finding nodes duplicated n times with a single path expression query

3 Answers3