4

I need to filter documents by date (last week, last month, etc.) with Marklogic 8. The database contains 1.3 million XML documents.

The documents look like this:

<work datum_gegenereerd="2015-06-10" gegenereerd="2015-06-10T14:28:48" label="gmb-2015-12000">
 ...

I've created a range element attribute index on work/@datum_gegenereerd (scalar type date).

The following query works but is slow (3 seconds):

xquery version "1.0-ml";
for $a in //work
where xs:date($a/@datum_gegenereerd) > current-date()-   5*xs:dayTimeDuration('P1D')
return
<hit>{base-uri($a)}</hit>

After a lot of experimenting, it turns out that I can get the performance down to 0.02 seconds by removing the xs:date cast from the where statement.

xquery version "1.0-ml";
for $a in //work
where $a/@datum_gegenereerd > current-date()-   5*xs:dayTimeDuration('P1D')
return
<hit>{base-uri($a)}</hit>

Can anyone explain this behaviour?


Update:
when I delete the attribute range index, the performance for the second variant goes down to 3+ seconds as well. And recreating the index brings the performance back up. This makes me wonder how to read David's statement below that there is no way to use a custom index from plain xquery. (BTW: the query returns 1267 XML documents, out of a possible 450000 documents with root element work in a total database of 1.35 million documents)
Update 2:
I messed up with the performance metric of 0.02 seconds. But it is very fast in the query console. Of the 3 versions, the cts-search seems a tiny bit faster.

M_breeb
  • 195
  • 1
  • 9

1 Answers1

7

You may have created an index, but you are not using it. You need to use an element-attribute-range-query to find all of the fragments that have dates in the range in question.

something like

cts:search(doc(), cts:element-attribute-range-query(xs:QName("work"), xs:QName("datum_gegenereerd"), ">" current-date()-   5*xs:dayTimeDuration('P1D'))

BUT: if you really just want the URIS, then the element-range-query would be used with cts:uris (sometihng like this - but check the docs)

cts:uris('', (), cts:element-attribute-range-query(xs:QName("work"), xs:QName("datum_gegenereerd"), ">" current-date()-   5*xs:dayTimeDuration('P1D'))

The second one does everything in memory and just pulls the URIs from the URI lexicon that point to document fragments where the date query matches.

  • David, thanks for the quick response. My second example (the one with xs:date removed) must be using an index because it is the same speed as your two solutions. Is this a different index than the range element attribute I've added to the database? – M_breeb Jan 21 '16 at 14:02
  • Every attribute and every element are indexed in the 'universal index'. Always. But for your examples, (something less than something else - a range), you need a range index for most efficiently asking those questions. I am quite surprised that the solution I provided (with a range index and a cts:element-attribute-range-query) only gives results at .02 seconds. How many documents are returned? Also, you would have had more overhead on the returning of the URI as well, so I would have expected even more speed gain with cts:uris – David Ennis -CleverLlamas.com Jan 21 '16 at 14:09
  • 1
    Thanks again. And I never considered the cts:uris function. Very useful to have available. – M_breeb Jan 21 '16 at 14:14
  • This could be useful to have around as well: http://docs.marklogic.com/guide/performance/order_by#id_59622 – grtjn Jan 21 '16 at 14:30
  • Yes, Geert brings up a good point for future reference as as well. the order by is also greatly affected by the range indexes. But in general, getting to know the performance guide on a high level as a start will make a big difference. – David Ennis -CleverLlamas.com Jan 21 '16 at 14:35
  • @grtjn - can you look at the update number 1 from the original post end explain how one would get almost the same results in either case (with and without the range index) using an FLWOR statement? My only guess is that the actual overheaad is on the 1200 base-uri() calls and not on the query itself. Any ideas? – David Ennis -CleverLlamas.com Jan 21 '16 at 16:31
  • Cold versus warm cache could pollute results easily. To cancel that you should restart ML between each test run. Pulling up the docs takes most time. With warm cache that time is pretty much nulled.. – grtjn Jan 21 '16 at 19:36