XQUERY: optimizing a query that touches large collections

Question

I have the following XQUERY running in ExistDB (against XML documents that follow the TEI schema):

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $data-collection := "/db/apps/deheresi/resources/documents"
let $people-collection := "/db/apps/deheresi/resources/documents/codes_people.xml"

for $msdoc in collection($data-collection)/tei:TEI[contains(@xml:id,'ms609')]

for $ordinal in $msdoc/tei:text/tei:front//tei:div[@type='registry_ordinal']/replace(@n, '#', '')

for $doctype in $msdoc/tei:text/tei:front//tei:div[@type='doc_type']/replace(@subtype, '#', '')

for $folio in $msdoc/tei:text/tei:front//tei:div[@type='folio']/replace(@n, '#', '')

for $nameref in $msdoc/tei:text/tei:body[1]/tei:p[1]/tei:seg[1]/tei:persName[@role = 'dep']/replace(@nymRef, '#', '') 

for $persname in normalize-space(string-join(doc($people-collection)//tei:person[@xml:id = $nameref]))

return concat('<td>',$ordinal,'</td><td>',$folio,'</td><td>',$doctype,'</td><td>',$persname,'</td>')

Organization of XML documents:

There are 700+ TEI documents, each with <TEI xml:id="foo_1.xml"> as the root node (document identifier increments foo_1.xml, foo_2.xml, foo_3.xml, etc.) (always in the same place)
Each TEI document contains a single unique element identifying a person <persName role="dep" nymRef="#unique_foo_name"> (not always in the same place in a document)
A separate XML document codes_people.xml that contains 1500+ xml:ids of distinct people

The function does the following:

get the identifying tei:TEI/@xml:id and the tei:persName[@role="dep"]/@nymRef from each xml document
With the tei:persName[@role="dep"]/@nymRef I look up the the name in codes_people.xml/tei:person/xml:id="unique_foo_name"

This all returns the expected results...except it's really, really slow (4 seconds). Obviously I'm testing on a local computer and not a server, but I would like to optimize the queries before testing on more powerful servers.

ADDED PER REQUEST:

ExistDB version : 3.3.0

Sample output (the eventual target is an HTML table)

<td>0001</td><td>1r</td><td>Deposition</td><td>Arnald Garnier</td> 
<td>0002</td><td>1r</td><td>Deposition</td><td>Guilhem de Rosengue</td> 
<td>0003</td><td>1r</td><td>Deposition</td><td>Hugo de Mamiros</td> 
<td>0004</td><td>1r</td><td>Deposition</td><td>P Lapassa senior</td>

Many thanks in advance.

EDIT: I've added more information in a self-response below, and a link to all the files in Dropbox in the comments.

I optimised one piece of code, changing `$nameref` to refer to `in $msdoc/` instead of going through the collection again. It's a bit fast but still not ideal. — jbrehr, Oct 08 '18 at 20:16
Two questions to help me understand your environment: Which version of eXist? Have you defined any indexes on the collections containing this TEI data? Lastly, can you add one example of what an expected result is for this query? — Joe Wicentowski, Oct 08 '18 at 20:55
I've added the exist DB version and some sample output. I've also update the code to reflect the fastest version I've tested - and perhaps easier for you to read. Finally, I haven't added indexes yet... — jbrehr, Oct 08 '18 at 21:21
I added `` and `` to `` in the config file. Re-indexed and reran the Xquery with the Query Prolifer on. It tells me that it used `range`but `optimization = no index`. Introducing these indexes did not improve the query speed. — jbrehr, Oct 08 '18 at 21:45
Also there are so many `for` loops, that it is hard to understand the cardinality of your join relationships without seeing representative data. However, I doubt you really need that many loops! — adamretter, Oct 09 '18 at 05:21
I've put the xml files into a .zip available here https://www.dropbox.com/s/2b8kw8b3pvpks4j/stackoverflowdocs.zip?dl=0 — jbrehr, Oct 09 '18 at 06:07
The first 5 `for` loops traverse the same document getting a variety of attribute values, the 6th then does a look-up and retrieval from another document (getting a printable name from text() values). If I remove the 6th `for` loop the query executes in less than 2 seconds. — jbrehr, Oct 09 '18 at 06:19
I've added a bounty to this question...following up my posting the data files requested. Hopefully someone identifies a solution (taking into account the simplifications I posted in response below). Solutions involving Xquery and eXist indexing are welcome. — jbrehr, Oct 10 '18 at 21:20

duncdrum · Accepted Answer · 2018-10-14T23:40:42.297

So there are a few problems with your code, that impact performance. The first is the fact that you process strings instead of xml paths. e.g. when using replace() instead of of some/@path[. = 'xyz']. Simply by using fn:id() instead of replace() will cut execution times to under 1 second.

The second is a missing xmlschema namespace declaration in your index config file, not that those indexes are used, since you are forcing exist to process strings instead of xml.

The third is the fact that your xquery code does not return a well-formed xml fragment, always a bad idea for performance reasons.

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";

declare variable $data-collection := "/db/apps/so-52709411/data";

(:always return a well-formed fragment:)
 <table>{

 let $people-collection := doc($data-collection || "/codes_people.xml")

 let $msdoc := collection($data-collection)//tei:TEI[contains(@xml:id,'ms609')]

 for $n in $msdoc
 let $registry := $n//tei:div[@type='registry_ordinal']/data(@n)
 let $type := $n//tei:div[@type='doc_type']/data(@subtype)
 let $folio := $n//tei:div[@type='folio']/data(@n)
 let $nym := substring-after($n//tei:persName[@role = 'dep']/data(@nymRef), '#') 
 let $persName := $people-collection//id($nym)/tei:persName

 return
<tr>
<td>{$registry}</td>
<td>{$type}</td>
<td>{$folio}</td>
<td>{$persName/string()
}</td>
</tr>

}
 </table>

combined with

<collection xmlns="http://exist-db.org/collection-config/1.0">
<index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <range>
       <create qname="tei:persName" type="xs:string"/>
       <create qname="tei:person" type="xs:string"/>
       <create qname="@type" type="xs:string"/>
       <create qname="@role" type="xs:string"/>
       <create qname="@nymRef" type="xs:string"/>
    </range>
</index>
<triggers>
    <trigger class="org.exist.extensions.exquery.restxq.impl.RestXqTrigger"/>
</triggers>
 </collection>

results in indexes actually being usable

but the data sample isn't really big enough to make much of a performance impact beyond the rewritten xquery. So even without indexes you should be running in the less then 1s range (depending on memory, hardware etc, YMMV)

you can download a working app running with your code here

This is a really helpful answer, thank you for spending time not only to help along the code but also to explain some fundamental errors. Thank you. — jbrehr, Oct 15 '18 at 06:05

jbrehr · Answer 2 · 2018-10-09T07:42:12.920

I've tried to simplify the Xquery by replacing certain for loops with let and concat():

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

declare variable $people-collection := doc("/db/apps/deheresi/resources/documents/codes_people.xml");

let $data-collection := "/db/apps/deheresi/resources/documents"

for $msdoc in collection($data-collection)/tei:TEI[contains(@xml:id,'ms609')]

    let $concat1 := concat('<td>',
               $msdoc//tei:div[@type='registry_ordinal']/replace(@n, '#', ''), 
               '</td><td>', 
               $msdoc//tei:div[@type='doc_type']/replace(@subtype, '#', ''), 
               '</td><td>',
               $msdoc//tei:div[@type='folio']/replace(@n, '#', ''),
               '</td><td>')

    (:  obtain the attribute value of persName[@role = 'dep']/@nymRef   :)
    let $nameref := $msdoc//tei:persName[@role = 'dep']/replace(@nymRef, '#', '') 

    (:  now use the attribute value to lookup a printable name using xml:id in document codes_people.xml :)
    let $persname := normalize-space(string-join($people-collection//tei:person[@xml:id = $nameref]))

return concat($concat1,$persname,'</td>')

These adjustments removed .5 seconds from the query execution time (now 3.5 seconds).

If I strip out the final lookup ($persname) the query executes in .17 seconds. The lookup to the file codes_people.xml appears to be the bottle neck.

EDIT: I've added the following indexes affecting the relevant elements and they have not produced any optimisations

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0">
        <range>
           <create qname="tei:persName" type="xs:string"/>
           <create qname="tei:person" type="xs:string"/>
        </range>
    </index>
    <triggers>
        <trigger class="org.exist.extensions.exquery.restxq.impl.RestXqTrigger"/>
    </triggers>
</collection>

View from Query Profiler:

Depending on the size of codes_people.xml, it might be worth simply loading that as your $people-collection variable and save repeatedly loading the document. — chrisis, Oct 09 '18 at 07:07
I've updated the example using @chrisis 's suggestion, but it does not affect performance. — jbrehr, Oct 09 '18 at 07:21

XQUERY: optimizing a query that touches large collections

2 Answers2