Difference(s) between Solr's Cursor and ElasticSearch's Scroll

Question

While looking for pagination with Solr and ElasticSearch, it turned out, both have the same "problem" (deep pagination, especially with shards). Though both search engines provide a solution/workaround for that:

Solr: cursor https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
ElasticSearch: scroll http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context

Now I read those pages and searched the internet, but I'm still a bit clueless at some points:

cursor / scroll timeouts (garbage collection):
1. Solr documentations doesn't seem to provide a way for setting a timeout (or some special query to invalidate a cursor token). That's basically just a question about possible memory leaks, etc.
2. ElasticSearch provides a timeout setting via scroll=1m.
backwards pagination:
1. Solr will provide a cursor token for each request, so it is possible to access any previous page.
2. ElasticSearch seems to use always the same scroll token. So I cannot go backwards without doing a new search?
Alter search query:
1. ElasticSearch explicitly requires to use a special URL for scroll queries ( http://localhost:9200/_search/scroll?scroll=1m?scroll_id=...). So there's no possibility to alter the search query.
2. Solr appends the cursor token to the normal query. Does this mean, that I can use some cursor token and change the query (filters, ordering, page size, etc.)?
Index changes while using scroll / cursor:
1. Solr documentation says, that if the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice. That's clear to me. But now there are two more questions, which don't get covered:
  1. What happens if I use the cursor token for page 2 (where document 1 was before the sort value change)? Will I see the old items (including document 1) or will I see a new generated page with freshly calculated documents?
  2. Basically the same question as before: Solr documentation says: the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress. If I use an old cursor token, will I be able to retrieve document 17? Or is it gone forever when using the current cursor token sequence?
2. ElasticSearch documentation says nothing about what happens if the index changes while using scroll. I could imagine that it behaves the same as Solr, because both use Lucene for that functionality. But I'm completely unsure, because there's no information about that scenario.
How can this be faster than simple size=10&from=10 / rows=5&start=0?
More kinda technical question, just because I'd like to understand what happens under the hood.
- I just wondered how (especially) Solr can do this cursor thing more efficient than normal pagination using start and rows. Reason: (as said above) If a document changes, it will get reindex and can be placed after/before the current cursor. That sounds to me, like it has to reorder all documents. And that's basically the same as the default pagination!?

EDIT:

ElasticSearch documentation says "A scrolled search takes a snapshot in time — it doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old datafiles around, so that it can preserve its “view” on what the index looked like at the time it started." So there's still the question: How does Solr handle this?

Would be cool, if someone could give me some explanation how things work.

Thanks in advance! :)

This "question" seems to be the best comparison of these features so far. It will be nice to get more answers/comments on this. — alisa, Mar 05 '16 at 22:50
"How does Solr handle this?" : SOLR does not take a snapshot at all, so new documents that would be returned AFTER the current cursor will be included in the results (even if the document is a duplicate of an earlier one already returned), new documents that would be returned BEFORE the position will not be returned at all. See https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results — Hugo Zaragoza, Jul 14 '16 at 18:40

score 4 · Accepted Answer · answered Apr 08 '17 at 14:21

Solr's cursor and start both function like open-ended range queries, with cursor operating like a less-than range query on score and start operating like a greater-than range query on rank. cursor is faster (especially for deep pagination) because, for a page size of 10, it only needs to hold in memory and sort at most the top 10 results, whereas start=N must hold in memory and sort the top N + 10 results, where N increases by 10 for each subsequent page. Both are sensitive to index modifications during pagination because each query runs against the current state of the index.

Elasticsearch's scroll functions like a single-use forward-only linear scan through a snapshot of the results of a fixed query which is guaranteed to return each document exactly once. It is not affected by index modifications because Elasticsearch remembers all the documents associated with the index at the time the "scroll context" was created by preserving the containing immutable segment files while the scroll context is alive. To avoid accumulating a stockpile of old segment files referred to by scroll contexts that will never be used again (perhaps because the client crashed), scroll contexts expire after a specified duration of time. My guess is that Elasticsearch supports neither jumping to arbitrary pages nor altering the query in order to optimize for scrolling efficiency.

You can partially emulate the behavior of Solr's cursor in Elasticsearch using an open-ended range query in which the upper/lower bound is set to the last value of the previous batch of results.

Following Chris's answer about Elasticsearch: there is a `search_after` option you can add to searches: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after . It works exactly in that direction (plus: not locking the index's segments). — thePanz, Oct 28 '19 at 10:04

Difference(s) between Solr's Cursor and ElasticSearch's Scroll

EDIT:

1 Answers1