cursorMark doesn't affect search - the search is still performed as it always is. cursorMark isn't an index or relevant for how the actual search is performed, but it's a strategy to allow efficient pagination through large data sets. This also means that your second question becomes moot, as it doesn't change anything about how the actual search is performed.
The reason why cursorMark solves deep pagination becomes apparent when you consider the case for a cluster of Solr servers, such as when running in SolrCloud mode.
Let's say you have four servers, A
, B
, C
, and D
, and want to retrieve 10 documents starting from row number 400 (we'll assume that one server == one shard of a larger collection to make this easier).
In the regular case, you'll have to start by retrieving (in sorted order, as each node will sort its result set according to your query - this isn't any different from any regular query as it will need to be sorted locally anyway), and then merging:
- 410 documents from server A
- 410 documents from server B
- 410 documents from server C
- 410 documents from server D
You now have to go through 1640 documents to find out what your actual result set will be, as it could just be that the 10 documents you're looking for, all lives on server C
. Or maybe 350 on server B
and the rest on server D
. It's impossible to say without actually retrieving 410 documents form each server. The result set will be merged and sorted until 400 documents have been skipped and 10 documents has been found.
Now say you want 10 documents starting from row 1 000 000 - you'll have to retrieve 1 000 010 documents form each server, and merge and sort through a result set of 4 000 040 documents. You can see this becoming more and more expensive as the number of servers and documents increase, just to increase the starting point by 10 documents.
Instead, let's assume that you know what the global sort order (meaning the lexical sort value of the last document returned) is. The first query, without a cursorMark, will be the same as for regular pagination - get the first 10 documents from each server (since we're starting at the start of the result set (and not from position 400 as in the first example), we only need 10 from each server).
We process these 40 documents (a very manageable size), sort them and retrieve the 10 first documents, and then we include the global sort key (the cursorMark) of the last document. The client then includes this "global sort key" in the request, which allows us to say "OK, we're not interested in any entries that would be sorted in front of this document, as we've already shown those". The next query would then do:
- 10 documents from server A, that would sort after
cursorMark
- 10 documents from server B, that would sort after
cursorMark
- 10 documents from server C, that would sort after
cursorMark
- 10 documents from server D, that would sort after
cursorMark
Now we're still just returning 10 documents from each server, even if our cursorMark
is a million rows deep into the pagination. Where we previously had to retrieve, sort (well - we can assume they're returned sorted from the result set, so we have to go through the result sets and find the first million entries, then pick the next ten from the sets, after retrieving them) and handle 4 000 040 documents, we now only have to retrieve 40 documents and sort them locally to get the actual 10 documents to return.
To further explain how a cursorMark
could work, let's assume that this approach only worked on unique columns with an integer value (since that makes it easier to show what the cursorMark represents internally, and why the uniqueKey has to be present in the sort) (if the uniqueKey wasn't present, we could randomly end up with documents missing if the cursorMark ended up on a document with multiple, identical values for the sort field):
A B C D
1 2 3 4
8 7 6 5
9 10 11 12
13 14 15 16
17 20 21 22
18 23 25 27
19 24 26 28
We make a request for 4 values (rows=4), starting form cursorMark 7. Each server can then look at its result set (which is sorted internally, as all result sets are), and retrieve 4 values starting from the value that comes after 7 in the sorted order: (< means this is where the first value after the cursorMark is, + means that this document is included in the result set returned from the node)
A B C D
1 2 3 4
8 < 7 6 5
9 + 10 < 11 < 12 <
13 + 14 + 15 + 16 +
17 + 20 + 21 + 22 +
18 23 + 25 + 27 +
19 24 26 28
We then iterate through each result set returned, until we've picked four documents from the top:
8 (from A)
9 (from A)
10 (from B)
11 (from C)
And we include the cursorMark of the last document: 11. The next request is then made with 11 as the cursorMark, meaning that each server can then return 4 documents starting from the entry after 11 instead:
A B C D
1 2 3 4
8 7 6 5
9 10 11 12 <
13 < 14 < 15 < 16 +
17 + 20 + 21 + 22 +
18 + 23 + 25 + 27 +
19 + 24 + 26 + 28
And then we perform the merge again, picking the first 4 entries in sorted order, and include the next cursorMark.
.. and that answers the third question: it doesn't need to know the global state, just what the next result that it needs to return from the gigantic dataset.