MarkLogic, XQuery, pagination, lazy evaluation

Question

MarkLogic documentation describes a fast pagination technique using unfiltered searching somewhat similar to this

let $uris := cts:uris((),(),
  cts:collection-query('fish')
  ) [1 to 10]
for $uri in $uris
let $fish := fn:doc($uri)
return <fish>
  { $fish/fish/variety }
  { $fish/fish/colour }
</fish>

In reality, the cts:uris() would have a much more complex search term.

Basically, the [1 to 10] controls the range of "rows" returned, and the following FLWOR is all about selecting the data to return.

What about if the result of the first search are to be joined with some other data, and/or filtered, and then only selected rows returned.

let $uris := cts:uris((),(),
  cts:collection-query('fish')
  )
for $uri in $uris
let $fish := fn:doc($uri)
let $pond := fn:doc($fish/fish/pond-uri/text())
where $d/fish/variety = ('koi','goldfish')
  and $pond/pond/type/text() = ('lilypond','gardenpond')
return <fishandpond>
  { $fish/fish/variety }
  { $pond/pond/type }
</fishandpond>

Again, I want the first 10 results. Clearly can't constrain the let $uris :=, as we don't know how many URIs we need to search to be sure to get at least 10 results after the following FLWOR.

Refactoring like this :-

let $uris := cts:uris((),(),
  cts:collection-query('fish')
  )
let $urisFiltered := for $uri in $uris
let $fish := fn:doc($uri)
let $pond := fn:doc($fish/fish/pond-uri/text())
where $d/fish/variety = ('koi','goldfish')
  and $pond/pond/type/text() = ('lilypond','gardenpond')
return <fishandpond>
  { $fish/fish/variety }
  { $pond/pond/type }
</fishandpond>

return $urisFiltered[1 to 10]

Does produce 10 results, but MarkLogic appears to actually compute the full set of URIs and then filter, and not lazily evaluate to produce 10 results, stopping once it got there, even if this means it only had to work out the first 15 or so elements of $uris.

I say this because if I add xdmp:sleep(1) into the loop, the query delays by an amount related to the total number of fish in the database, not the number required in the final result set.

For my next attempt, I tried using the XCC/J interface and using Request.setCount(10) to indicate that I only care about the first 10 results. Again, I get 10 results, but all indications are that it isn't executing lazily and is actually finding all fish and filtering.

So, my question is:

Is there a known coding pattern that can achieve efficient paginated (or even just first N results) searches, when documents need to be joined and/or filtered, after an initial cts:uris() or cts:search() step?

And as a supplementary question: is there a good summary of when MarkLogic does behave in a lazy fashion, and when it doesn't?

{{{ Andy

wst · Answer 1 · 2015-02-27T21:06:00.407

When you call a function (i.e.: xdmp:sleep()) in a loop like that, you may be short circuiting optimizations in the evaluator to keep the expression lazy and/or use only indexes.

Generally, it tries to be lazy whenever it can, and the optimizer sometimes improves version to version. Avoiding function calls (even fn:text()) in predicates or large loops, and not using * are good rules. But you are best off using xdmp:plan to see exactly what's going on under the hood.

If it can be avoided, it's typically better not to join using URIs, or not join at all. But here's another way to approach this that stops evaluation early and should have fewer stack-related issues:

(for $fish in cts:search(//fish,
  cts:and-query((
    cts:collection-query('fish'),
    cts:element-value-query(xs:QName('variety'), ('koi', 'goldfish'))
    ))
  )
let $pond-type := fn:doc($fish/fish/pond-uri)/pond/type
where ($pond/type = ('lilypond','gardenpond'))
return 
  element fishandpond {
    $fish/variety,
    $pond-type
  })[1 to 10]

However, ideally for MarkLogic, you would denormalize this part of your data model and avoid the join altogether. Sometimes joins are necessary, but sometimes they're just a mindset from relational modeling. Just be aware that by frequently calling doc() you run the risk of heavy IO.

If you must join, and performance is important, you can use a query pattern called a "scatter query" that relies on range indexes. There is a a great explanation and example of this in the Inside MarkLogic Server whitepaper.

Yes, the variety test could be moved. The avoiding function calls comment is more interesting, ie: was it the removal of /text() that makes a difference? This solution still includes a function call, fn:doc() so why isn't laziness defeated here too? I'll look for "scatter query" paper. — Andy Key, Feb 27 '15 at 21:28
@AndyKey Your question hit on a few complex topics, so please understand it's hard to explain fully in the scope of an SO answer. For data modeling best practices definitely read the whitepaper and relevant articles at docs.marklogic.com. The function call suggestion is for large loops or XPath like /path/to/somewhere[myfunc(.)=true()] - the function will be evaluated for every path in the database. If this expression were "searchable" (again, see ML docs), the optimizer will rewrite it into a query that uses indexes and (generally) avoids filtering and IO. xdmp:query-trace() is a great tool. — wst, Feb 28 '15 at 15:32

score 0 · Answer 2 · answered Feb 27 '15 at 20:40

If you only need to get 10 results, you can iterate over the uris by moving the body of the FLWOR expression into a a function that calls itself recursively until it has accumulated 10 results.

If you were getting a huge number of results, you would want to take a different approach to avoid blowing out the stack.

MarkLogic, XQuery, pagination, lazy evaluation

2 Answers2