How does Solr's cursorMark solve deep pagination while being stateless?

Question

This question has been asked before but I'm not satisfied and I need further elaboration.

These are the basic premises:

cursorMark is stateless. It doesn't store anything on the server
cursorMark is a computed value that tells whether to skip a document or not

I'm not sure if I understood it properly but that's how I read the given explanations.

My questions are:

If cursorMark is meant to know which documents to skip then how is this not a search? It basically is a search by running through a sequence of documents and asking the question "is this what I'm looking for or do I need to skip this?"
Still related to the first question, how is this "sequence of documents" calculated? Isn't that stored in the memory? Or is cursorMark creating a temporary file stored on the disk?
How does it calculate the next cursorMark if it doesn't remember about the entire universe of results?

All I can see is that there's no escape.

Either you store some state about your search results or perform search for each page requests.

Related references:

cursorMark is stateless and how it solves deep paging

How much time does the cursorMark is available on solr server

http://yonik.com/solr/paging-and-deep-paging/

score 3 · Answer 1 · edited Nov 13 '18 at 20:27

cursorMark doesn't affect search - the search is still performed as it always is. cursorMark isn't an index or relevant for how the actual search is performed, but it's a strategy to allow efficient pagination through large data sets. This also means that your second question becomes moot, as it doesn't change anything about how the actual search is performed.

The reason why cursorMark solves deep pagination becomes apparent when you consider the case for a cluster of Solr servers, such as when running in SolrCloud mode.

Let's say you have four servers, A, B, C, and D, and want to retrieve 10 documents starting from row number 400 (we'll assume that one server == one shard of a larger collection to make this easier).

In the regular case, you'll have to start by retrieving (in sorted order, as each node will sort its result set according to your query - this isn't any different from any regular query as it will need to be sorted locally anyway), and then merging:

410 documents from server A
410 documents from server B
410 documents from server C
410 documents from server D

You now have to go through 1640 documents to find out what your actual result set will be, as it could just be that the 10 documents you're looking for, all lives on server C. Or maybe 350 on server B and the rest on server D. It's impossible to say without actually retrieving 410 documents form each server. The result set will be merged and sorted until 400 documents have been skipped and 10 documents has been found.

Now say you want 10 documents starting from row 1 000 000 - you'll have to retrieve 1 000 010 documents form each server, and merge and sort through a result set of 4 000 040 documents. You can see this becoming more and more expensive as the number of servers and documents increase, just to increase the starting point by 10 documents.

Instead, let's assume that you know what the global sort order (meaning the lexical sort value of the last document returned) is. The first query, without a cursorMark, will be the same as for regular pagination - get the first 10 documents from each server (since we're starting at the start of the result set (and not from position 400 as in the first example), we only need 10 from each server).

We process these 40 documents (a very manageable size), sort them and retrieve the 10 first documents, and then we include the global sort key (the cursorMark) of the last document. The client then includes this "global sort key" in the request, which allows us to say "OK, we're not interested in any entries that would be sorted in front of this document, as we've already shown those". The next query would then do:

10 documents from server A, that would sort after cursorMark
10 documents from server B, that would sort after cursorMark
10 documents from server C, that would sort after cursorMark
10 documents from server D, that would sort after cursorMark

Now we're still just returning 10 documents from each server, even if our cursorMark is a million rows deep into the pagination. Where we previously had to retrieve, sort (well - we can assume they're returned sorted from the result set, so we have to go through the result sets and find the first million entries, then pick the next ten from the sets, after retrieving them) and handle 4 000 040 documents, we now only have to retrieve 40 documents and sort them locally to get the actual 10 documents to return.

To further explain how a cursorMark could work, let's assume that this approach only worked on unique columns with an integer value (since that makes it easier to show what the cursorMark represents internally, and why the uniqueKey has to be present in the sort) (if the uniqueKey wasn't present, we could randomly end up with documents missing if the cursorMark ended up on a document with multiple, identical values for the sort field):

A    B    C    D
1    2    3    4
8    7    6    5
9    10   11   12
13   14   15   16
17   20   21   22
18   23   25   27
19   24   26   28

We make a request for 4 values (rows=4), starting form cursorMark 7. Each server can then look at its result set (which is sorted internally, as all result sets are), and retrieve 4 values starting from the value that comes after 7 in the sorted order: (< means this is where the first value after the cursorMark is, + means that this document is included in the result set returned from the node)

A    B    C    D
1    2    3    4
8  < 7    6    5
9  + 10 < 11 < 12 <
13 + 14 + 15 + 16 +
17 + 20 + 21 + 22 +
18   23 + 25 + 27 +
19   24   26   28

We then iterate through each result set returned, until we've picked four documents from the top:

8 (from A)
9 (from A)
10 (from B)
11 (from C)

And we include the cursorMark of the last document: 11. The next request is then made with 11 as the cursorMark, meaning that each server can then return 4 documents starting from the entry after 11 instead:

A    B    C    D
1    2    3    4
8    7    6    5
9    10   11   12 <
13 < 14 < 15 < 16 +
17 + 20 + 21 + 22 +
18 + 23 + 25 + 27 +
19 + 24 + 26 + 28

And then we perform the merge again, picking the first 4 entries in sorted order, and include the next cursorMark.

.. and that answers the third question: it doesn't need to know the global state, just what the next result that it needs to return from the gigantic dataset.

About the result set, is it temporarily created for every new unique search? How long does it live? Does it span across page requests? Does Solr cache the result set to serve other page requests? — supertonsky, Sep 23 '16 at 04:21
A result set needs to be created for a search, yes. There's [several caches in Solr that speeds up the creation of a result set](https://wiki.apache.org/solr/SolrCaching), and one of them is the QueryResultCache - which, as the name says, keeps the N top results from a query. A cache lives until the cache expires, which can occur because of new documents in the index (which can match the query), eviction because of the result not being used (caches are by default LRU, so the most used entries are kept), and yes, they live across requests (otherwise they wouldn't be very useful). — MatsLindh, Sep 23 '16 at 09:22
In addition, Lucene has a set of caches on a lower level as well, but these aren't (at least not usually) tunable from Solr. — MatsLindh, Sep 23 '16 at 09:22
So actually, in effect, cursorMarks do create some state on the server. It maybe stateless by itself but the result set it is based on is not. So to sum it up, Solr has to use memory (for the result set) in order to solve the deep pagination. Is this correct? — supertonsky, Sep 23 '16 at 11:54
No, cursorMarks does not create any state on the server. They simply tell Solr where to start in a result set at a specific node, without having to retrieve the complete result set from each node. Solr (and any system) has to compute its result set in some way, and will use memory to do so. And if the result is not cached, it will have to be recomputed. cursorMarks does _not_ affect this. cursorMarks are only relevant for _pagination_, not for how a result set is obtained. cursorMarks does not require any state to be kept on the server regarding pagination or the result set itself. — MatsLindh, Sep 23 '16 at 15:11
To be able to find out which documents belong in a result set and their score, Solr will have to use memory and computation resources. But that does not really depend on pagination, and does not affect how and why cursorMarks are useful. cursorMarks are an optimization of a problem that occurs with deep pagination and a collection spread across many nodes. — MatsLindh, Sep 23 '16 at 15:15
That's how I understood it. Of course result set will not depend on pagination but it's the other way around. Pagination depends on the result set. I really just wanted to explicitly be clarified about whether solving deep pagination requires Solr to cache (use memory) result set or not. None of the articles I've read clarified this matter as if Solr solved the problem without a cost because of cursorMarks. — supertonsky, Sep 26 '16 at 03:31
"solving deep pagination" does not require additional memory compared to what retrieving a regular, single result set would require. Whether the result set is (completely) recomputed or cached is a separate issue from cursorMarks and deep pagination. — MatsLindh, Sep 26 '16 at 09:12

How does Solr's cursorMark solve deep pagination while being stateless?

1 Answers1