3

I'm using Elasticsearch 7.6.0 and have paginated one of my queries. It seems to work well, and I can vary the number of results per page and the selected page using the search from and size parameters.

    query = 'sample query'
    items_per_page = 12
    page = 0

    es_query = {'query': {
        'bool': {
            'must': [{
                'multi_match': {
                    'query': query,
                    "fuzziness": "AUTO",
                    "operator": "and",
                    'fields': ['title^2', 'description']
                },
            }]
        }
    }, 'min_score': 5.0}

    res = es.search(index='my-index', body=es_query, size=items_per_page, from_=items_per_page*page)
    hits = sorted(res['hits']['hits'], key=lambda x: x['_score'], reverse=True)

    print(res['hits']['total']['value']) # This changes depending on the page provided

I've noticed that the number of results returned depends on the page provided, which makes no sense to me! The number of results also oscillates which further confuses me: Page 0, 233 items. Page 1, 157 items. Page 2, 157 items. Page 3, 233 items...

Why does res['hits']['total']['value'] depend on the size and from parameters?

David Ferris
  • 2,215
  • 6
  • 28
  • 53

3 Answers3

2

The search is distributed and being sent to all the nodes holding shards matching the searched indices. Then all the results will be merged and returned. Sometimes, not all shards can be searched. This happens when

  • The cluster is very busy
  • The specific shard is not available due to recovery process
  • The search has been optimized and the shard has been omitted.

In the response, there is a _shards section like this:

{
    "took": 1,
    "timed_out": false,
    "_shards":{
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{...}
}

Check if there is any value other than 0 for failed shards. If so, check the logs and cluster and index status.

ibexit
  • 3,465
  • 1
  • 11
  • 25
0

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-track-total-hits

Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.

When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.

Community
  • 1
  • 1
Alkis Kalogeris
  • 17,044
  • 15
  • 59
  • 113
  • 1
    I appreciate the suggestion, but unfortunately this did not fix the problem. I am still seeing variation in `res['hits']['total']['value']` – David Ferris May 05 '20 at 01:52
-1

len(res['hits']['hits']) will always return the same number as specified in items_per_page (i.e. 12 in your case), except for the last page, where it might return a number smaller or equal to 12.

However, res['hits']['total']['value'] is the total number of documents in your index, not the number of results returned. If the number of documents increases, it means that new documents got indexed between the last query and the current one.

Val
  • 207,596
  • 13
  • 358
  • 360