xquery optimizing for multiple parameters on documents in hierarchy

Question

This may be a difficult/overly complex question to answer, however I pose it in hopes of speeding up a query that passes criteria to two sets of documents that are linked in a hierarchy. This is in Xquery 3.1 (under eXist-db 4.7).

The database contains TEI XML documents which are divided into two categories: header and exemplum, where each header is the "master" document to a subset of exemplum. For example, the header document contains the author of all the exemplum under it. There are about 100 headers, each header has anywhere between 50-300 exemplum under it.

The query receives criteria from an HTML form that allows users to search exemplum but using criteria in both header and exemplum. I indicate the structure of the documents with applicable fields below.

Example of header:

<bibl xml:id="TC0001" type="header" subtype="published">
    <title type="long">some long title</title>
    <title type="short">some short title</title>
    <author nymRef="stephanus_de_borbone"/>
    <affiliation corresp="dominican"/>
    <date notBefore="1267" notAfter="1290">
    ...
</bibl>

Example of exemplum (linked to header through @corresp):

<TEI xml:id="TE003679" corresp="TC0001" type ="exemplum" subtype="published">
    <text>
       <front>
          <div type="source-text">
              <p>Nulla a mauris urna. Suspendisse urna felis, suscipit consectetur aliquam ac, fermentum sit amet eros. Morbi semper, nisl ac tincidunt laoreet, nulla dui interdum magna, quis rhoncus arcu lectus sit amet ex. Nulla ut malesuada augue, vel hendrerit quam. </p>
          </div>
          <div type="allegory" n="y"/>
          <div type="keywords">
              <list>
                <item type="keyword" corresp="KW0003"/>
                <item type="keyword" corresp="KW0078"/>
                <item type="keyword" corresp="KW0537"/>
                <item type="keyword" corresp="KW1972"/>
              </list>
          </div>
       </front>
       <body>
          <p xml:lang="fr">As main soit tu elle. Fenetres jet feu quarante galopent but. Souvenirs corbeille chambrees vif demeurons gaillards oui. Son les noircir eau murmure entiere abattit puisque lettres. Cime la soir ai arcs sons. Remarquent petitement ah on diplomates cathedrale. </p>
          <p xml:lang="it">Nervi cigli di farai oblio buone le ti veste. Fanciullo lavorando ha ho melagrani osservava rivederci si strappato da. Punge tardi verra al in passa ed te. Comprendi ch po distrutta statuario. Col ascoltami rammarico oltremare ama. Forse sta bel campo andro sapro. Salvata su seconda divieto ritrovi ai. </p>
          <p xml:lang="en">Can curiosity may end shameless explained. True high on said mr on come. An do mr design at little myself wholly entire though. Attended of on stronger or mr pleasure. Rich four like real yet west get. Felicity in dwelling to drawings. His pleasure new steepest for reserved formerly disposed jennings. </p>
       </body
    </text>
</TEI>

The request comes in the form of parameters that are scrubbed and set into sequences of strings to apply in the query. The user will not likely submit all of these, but I put them all here to illustrate the possibilities of combinations of parameters acting on the headerand the exemplumin different stages of the query process:

 let $paramHeader := ("TC0003", "TC0019")
 let $paramAuthor := ("stephanus_de_borbone", "johannes_gobi")
 let $paramAffil := "dominican"
 let $paramBegDate := "1245"
 let $paramEndDate := "1300"
 let $paramAlleg := "y"
 let $paramKeyword := ("KW0002", "KW0034")
 let $paramTerms := "sta*"

All of the elements/attributes affected by parameters have been indexed in eXist. The first step is I apply relevant parameters to the header:

  let $headers := 
        for $h in $mydb//bibl[$paramCollect = ("", @xml:id) and @subtype="published"]
        where  $h/author[$paramAuthor = ("", @nymRef)] and 
               $h/affiliation[$paramAffil = ("",@corresp)] 
               $h/date[(@notBefore lt $paramBegDate and @notAfter gt $paramBegDate) or (@notBefore lt $paramEndDate and @notAfter gt $paramEndDate)]    
        return $h

From this result I can extract the header/@xml:id to apply as criteria to exemplum:

let $headids := distinct-values($headers/@xml:id)

These are then submitted to a query which uses eXist's ft:query to perform a Lucene-based full-text search (I apply it only to p elements). :

 let $query := <query>{for $t in $paramTerms
                      return <wildcard>{normalize-space(lower-case($t))}</wildcard>}</query>
 (: apply full text query :)
 let $luchits := $mydb//TEI//p[ft:query(.,$query)]
 (: then filter those hits with criteria from exemplum parameters, using /ancestor :)
 return 
      for $luchit in $luchits 
      where $luchit/ancestor::TEI[@corresp=($headids) and @subtype="published"] and
            $luchit/ancestor::text/front/div[@type="allegory" and  $paramAlleg = ("", @n)] and
            $luchit/ancestor::text//item[@type="keyword" and $paramKeyword = ("", @corresp)]
      return $luchit

The Lucene query in eXist is super fast, and for that reason I apply ft:query first and then apply where statements to the results. Doing this proved much faster than applying Lucene at the very end.

Depending on the criteria, the query can take 1-12 seconds to run. I'd like to see if I can shave down the upper end of that range with optimizing basic query technique.

Many thanks in advance.

xquery optimizing for multiple parameters on documents in hierarchy

0 Answers0