75

Example query:

GET hostname:port /myIndex/_search { 
    "size": 10000,
    "query": {
        "term": { "field": "myField" }
    }
}

I have been using the size option knowing that:

index.max_result_window = 100000

But if my query has the size of 650,000 Documents for example or even more, how can I retrieve all of the results in one GET?

I have been reading about the SCROLL, FROM-TO, and the PAGINATION API, but all of them never deliver more than 10K.

This is the example from Elasticsearch Forum, that I have been using:

GET /_search?scroll=1m

Can anybody provide an example where you can retrieve all the documents for a GET search query?

Vy Do
  • 46,709
  • 59
  • 215
  • 313
Franco
  • 875
  • 1
  • 6
  • 14

13 Answers13

71

Scroll is the way to go if you want to retrieve a high number of documents, high in the sense that it's way over the 10000 default limit, which can be raised.

The first request needs to specify the query you want to make and the scroll parameter with duration before the search context times out (1 minute in the example below)

POST /index/type/_search?scroll=1m
{
    "size": 1000,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

In the response to that first call, you get a _scroll_id that you need to use to make the second call:

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

In each subsequent response, you'll get a new _scroll_id that you need to use for the next call until you've retrieved the amount of documents you need.

So in pseudo code it looks somewhat like this:

# first request
response = request('POST /index/type/_search?scroll=1m')
docs = [ response.hits ]
scroll_id = response._scroll_id

# subsequent requests
while (true) {
   response = request('POST /_search/scroll', scroll_id)
   docs.push(response.hits)
   scroll_id = response._scroll_id
}

UPDATE:

Please refer to the following answer which is more accurate regarding the best solution for deep pagination: Elastic Search - Scroll behavior

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks val. I'm not sure I can get this working with curl within php. Unless I can parametrise the get scroll Id and knowing in advance how many docs I will have to retrieve. You see I'm not using sense or either kibana. I have to use google chart to do advance aggregations and I have to query elastic to get two large set of data. Regex them and store result in arrays. Elastic api can be very exotic. Do you think there i a simpler way to retrieve all data? Can index max value be increased ? Or is there any simpler way to use scroll Id s? – Franco Jan 15 '17 at 08:31
  • 2
    You can definitely [increase the `index.max_result_window` value](https://www.elastic.co/guide/en/elasticsearch/reference/5.1/index-modules.html#dynamic-index-settings) but you'll run the risk of bringing down your cluster if you want to get your 650000 documents in one shot. – Val Jan 17 '17 at 04:14
  • Another possibility is to query ES from within a Google Script so it's easier to integrate the results with Google Charts – Val Jan 17 '17 at 04:17
  • Otherwise you can stay with curl and use [existing solutions](https://gist.github.com/cb372/4567f624894706c70e65) to scroll over your index. – Val Jan 17 '17 at 04:18
  • Hey @Val; i will test this asap and give you feedback. I apology for the delay. I promise i will do this in the next 3-4 days max. – Franco Jan 18 '17 at 21:33
43

Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-from-size.html

So You'll have TWO approches here:

1.add the your query the "track_total_hits": true variable.

GET index/_search
{
    "size":1,
    "track_total_hits": true
}

2.Use the Scroll API, but then you can't do the from,size in the ordinary way and you'll have to use the Scroll API.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html

for example:

 POST /twitter/_search?scroll=1m
{
"size": 100,
"query": {
    "match" : {
        "title" : "elasticsearch"
    }
}
}
Eran Peled
  • 767
  • 6
  • 6
  • 2
    While this code may resolve the OP's issue, it is best to include an explanation as to how your code addresses the OP's issue. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform, that differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation. – ysf Jun 21 '20 at 22:49
  • 1
    track_total_hits was the ticket for me. I don't want to get a large result window, but I did want to know how many hits there were. – Greg B Sep 16 '20 at 17:34
  • 1
    "track_total_hits" worked for me. Thanks! – Meysam Dec 29 '20 at 06:32
  • I don't get how "track_total_hits" can work or whats the point there? In the documentation linked to this approach it states that "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". This means, that the first approach cant be used to display more than 10k entries? – Asdf11 Feb 04 '22 at 08:55
16

nodeJS scroll example using elascticsearch:

const elasticsearch = require('elasticsearch');
const elasticSearchClient = new elasticsearch.Client({ host: 'esURL' });

async function getAllData(query) {
  const result = await elasticSearchClient.search({
    index: '*',
    scroll: '10m',
    size: 10000,
    body: query,
  });

  const retriever = async ({
    data,
    total,
    scrollId,
  }) => {
    if (data.length >= total) {
      return data;
    }

    const result = await elasticSearchClient.scroll({
      scroll: '10m',
      scroll_id: scrollId,
    });

    data = [...data, ...result.hits.hits];

    return retriever({
      total,
      scrollId: result._scroll_id,
      data,
    });
  };

  return retriever({
    total: result.hits.total,
    scrollId: result._scroll_id,
    data: result.hits.hits,
  });
}
Vitim.us
  • 20,746
  • 15
  • 92
  • 109
zooblin
  • 2,172
  • 2
  • 27
  • 33
  • There's an updated example [here](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/scroll_examples.html) using generator function – Chukwuma Nwaugha Jan 13 '21 at 11:49
7

Another option is the search_after Tag. Joined with a sorting mechanism, you can save your last element in the first return and then ask for results coming after that last element.

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}

Worked for me. But until now getting more than 10.000 documents is really not easy.

Kaspar Lee
  • 5,446
  • 4
  • 31
  • 54
Dan
  • 515
  • 6
  • 20
  • 2
    what is 1463538857 and "654323" – Kartikeya Mishra Feb 25 '20 at 15:47
  • 1
    same here: ' "search_after": [1463538857, "654323"]' how to get those array values? Any java example would really help a lot. thanks – user404 Jun 22 '20 at 08:46
  • 3
    This is the only correct answer that lets you scroll without limitations. You can increase your scroll window, but besides the fact that ES recommends against it (and that there's cost to the scrolling), it's always limited by the size of your scroll window. search_after does not come with that limitation, though one disadvantage to it is that you don't have positional data (so you can't 'calculate' what page you're on) – Eelco Mar 02 '21 at 19:27
  • Object []lastSortValues; for (SearchHit documentFields : response.getHits()) { lastSortValues = documentFields.getSortValues(); } – Maayan Hope Jan 26 '22 at 13:24
3

You can use scroll to retrieve more than 10000 records. Below is the Python function example to achieve scroll.

self._elkUrl = "http://Hostname:9200/logstash-*/_search?scroll=1m"
self._scrollUrl="http://Hostname:9200/_search/scroll"
"""
Function to get the data from ELK through scrolling mechanism
"""
import logging
import pandas as pd
import requests
import sys


def GetDataFromELK(self):
    # implementing scroll and retrieving data from elk to get more than 100000 records at one search
    # ref :https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
    try:
        dataFrame = pd.DataFrame()
        if self._elkUrl is None:
            raise ValueError("_elkUrl is missing")
        if self._username is None:
            raise ValueError("_userNmae for elk is missing")
        if self._password is None:
            raise ValueError("_password for elk is missing")
        response = requests.post(self._elkUrl, json=self.body,
                                 auth=(self._username, self._password))
        response = response.json()
        if response is None:
            raise ValueError("response is missing")
        sid = response['_scroll_id']
        hits = response['hits']
        total = hits["total"]
        if total is None:
            raise ValueError("total hits from ELK is none")
        total_val = int(total['value'])
        url = self._scrollUrl
        if url is None:
            raise ValueError("scroll url is missing")
        # start scrolling 
        while (total_val > 0):
            # keep search context alive for 2m
            scroll = '2m'
            scroll_query = {"scroll": scroll, "scroll_id": sid}
            response1 = requests.post(url, json=scroll_query,
                                      auth=(self._username, self._password))
            response1 = response1.json()
            # The result from the above request includes a scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results
            sid = response1['_scroll_id']
            hits = response1['hits']
            data = response1['hits']['hits']
            if len(data) > 0:
                cleanDataFrame = self.DataClean(data)
                dataFrame = dataFrame.append(cleanDataFrame)
            total_val = len(response1['hits']['hits'])
            num = len(dataFrame)
        print('Total records recieved from ELK=', num)
        return dataFrame
    except Exception as e:
        logging.error('Error while getting the data from elk', exc_info=e)
        sys.exit()

MatthewMartin
  • 32,326
  • 33
  • 105
  • 164
Ranjita Shetty
  • 597
  • 5
  • 8
3

Upgrade the 10000 limit

PUT _settings
{
  "index.max_result_window": 500000
}
bereket gebredingle
  • 12,064
  • 3
  • 36
  • 47
1

I can suggest a better way to do this. I guess you're trying to get more than 10,000 records. Try the below way and you will get millions of records as well.

  1. Define your client.

    client = Elasticsearch(['http://localhost:9200'])
    
  2. search = Search(using=client)

  3. Check total number of hits.

    results = search.execute()
    results.hits.total
    
  4. s = Search(using=client)

  5. Write down your query.

    s = s.query(..write your query here...)
    
  6. Dump the data into a data frame with scan. Scan will dump all the data into your data frame even if it's in billions, so be careful.

    results_df = pd.DataFrame((d.to_dict() for d in s.scan()))
    
  7. Have a look at your data frame.

    results_df
    
  8. If you're getting an error with search function, then do below:

    from elasticsearch_dsl import Search
    
karel
  • 5,489
  • 46
  • 45
  • 50
ak3191
  • 583
  • 5
  • 14
  • 1
    @mairan Its working fine for me. Don't try to get all the data and I guess that is why it's crashing. You must be getting lots of data. First, check how many hits you are getting. Go through my medium blog for a better understanding :- https://medium.com/@abhimanyusingh_16119/getting-started-with-elasticsearch-in-python-1cf840549f90 .please accept the answer if it works. – ak3191 Sep 17 '18 at 12:52
1

When there are more than 10000 results, the only way to get the rest is to split your query to multiple, more refined queries with more strict filters, such that each query returns less than 10000 results. And then combine the query results to obtain your complete target result set.

This limitation to 10000 results applies to web services that are backed by ElasticSearch index, and there’s just no way around it, the web service would have to be reimplemented without using ElasticSearch.

Tenusha Guruge
  • 2,147
  • 3
  • 18
  • 38
  • Hi @Tenusha.I am trying to get 1 lakh records from elastic search through RestClient using Search query.I am sending one query now.Can you tell how can i split queries into multiple and get records so that it increases the performance as well – Sachin HR Jun 27 '20 at 08:00
  • @SachinHR There is a way to retrieve more than 10000 records. But be aware of the consequences (ie memory). Refer to this (https://discuss.elastic.co/t/pulling-more-than-10000-records-from-elasticsearch-query/181000) Also refer https://www.quora.com/How-do-I-retrieve-more-than-10000-records-in-elastic-search – Tenusha Guruge Jun 29 '20 at 04:41
1

Scroll API has its own limitation. Recently elastic introduce a new functionality (Point in Time).

Point in time

Basically it take a snapshot of index at that time and then you can use search_after to retrieve result beyond 10000.

Sajjan Kumar
  • 353
  • 1
  • 3
  • 16
0

Look at search_after documentation

Example query as hash in Ruby:

query = {
  size: query_size,
  query: {
    multi_match: {
      query: "black",
      fields: [ "description", "title", "information", "params" ]
    }
  },
  search_after: [after],
  sort: [ {id: "asc"} ]

}

  • is there any java implementation as example for search-after ? Also, in `search_after` field, what should be the value here? how to get that? – user404 Jun 22 '20 at 08:42
0

here you go:

GET /_search
{
  "size": "10000",
    "query": {
        "match_all": {"boost" : "1.0" }
    }
}

But we should mostly avoid this approach to retrieve huge amount of docs at once as it can increase data usage and overhead.

0

For Node.js, starting in ElasticSeach v7.7.0, there is now a scroll helper!

Documentation here: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/7.x/client-helpers.html#_scroll_documents_helper

Otherwise, the main docs for the Scroll API have a good example to work off of: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/scroll_examples.html

0
  1. On "Dev Tools" in Elasticsearch set a new max_result_window per index:
PUT indexname/_settings
{
      "index.max_result_window": 30000 # example of 30000 documents
}
  1. For the search command: Use with from and size:
res = elastic_client.search(index=index_bu, request_timeout=10, 
body={
  "from": 0, # get from number of document  
  "size": 15000, # how much documents
  "query": {"match_all": {}}
})
  1. The next request will be "from": 15000, "size": 15000
Yakir GIladi Edry
  • 2,511
  • 2
  • 17
  • 16