Efficient way to retrieve all _ids in ElasticSearch

Question

What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.

I found [this](https://github.com/elastic/elasticsearch/issues/17159) very helpful. — shellbye, Apr 14 '17 at 02:36

score 86 · Accepted Answer · edited Jul 04 '23 at 11:05

Edit: Please also read the answer from Aleck Landgraf

You just want the elasticsearch-internal _id field? Or an id field from within your documents?

For the former, try

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}
'

If you are using Elastic dev tools, use this instead:

GET <your-index-name>/_search
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}

Note 2017 Update: The post originally included "fields": [] but since then the name has changed and stored_fields is the new value.

The result will contain only the "metadata" of your documents

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index",
      "_type" : "type",
      "_id" : "36",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "38",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "39",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "34",
      "_score" : 1.0
    } ]
  }
}

For the latter, if you want to include a field from your document, simply add it to the fields array

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "fields": ["document_field_to_be_returned"]
}
'

Doing a straight query is not the most efficient way to do this. When you do a query, it has to sort all the results before returning it. Scroll and Scan mentioned in response below will be much more efficient, because it does not sort the result set before returning it. — aamiri, Oct 20 '16 at 15:58
Doesn't work anymore in 5.x, field `fields` was removed, instead, add `"_source": false` param. — Dzmitry Lazerka, Mar 24 '17 at 07:24
"field" is not supported in this query anymore by elasticsearch. use "stored_field" instead — Freak, Nov 16 '17 at 14:02
This will not return ids. For me trying on Elasticsearch 8.7, it returns 10000 results. — jdhao, Jul 04 '23 at 10:29

Aleck Landgraf · Answer 2 · 2017-06-08T18:08:55.940

Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.

With the elasticsearch-dsl python lib this can be accomplished by:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)

s = s.fields([])  # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]

Console log:

GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...

Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.

Method `fields` has been removed in version `5.0.0` (see: https://elasticsearch-dsl.readthedocs.io/en/latest/Changelog.html?highlight=fields(#id2). You should now use `s = s.source([]) `. — illagrenan, Dec 21 '16 at 12:11
search_type=scan deprecated since 2.1. ([https://www.elastic.co/guide/en/elasticsearch/reference/2.1/breaking_21_search_changes.html](https://www.elastic.co/guide/en/elasticsearch/reference/2.1/breaking_21_search_changes.html)) — aleha_84, Oct 10 '17 at 08:46

score 25 · Answer 3 · answered Nov 14 '16 at 04:25

25

For elasticsearch 5.x, you can use the "_source" field.

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

"fields" has been deprecated. (Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")

answered Nov 14 '16 at 04:25

Nav

1,185
16
23

2

Bonus points for adding the error text. Elasticsearch error messages mostly don't seem to be very googlable :( – AmericanUmlaut Nov 08 '17 at 01:52

score 16 · Answer 4 · edited Apr 26 '22 at 19:36

Elaborating on answers by Robert Lujo and Aleck Landgraf, if you want the IDs in a list from the returned generator, here is what I use:

from elasticsearch import Elasticsearch
from elasticsearch import helpers


es = Elasticsearch(hosts=[YOUR_ES_HOST])
hits = helpers.scan(
    es,
    query={"query":{"match_all": {}}},
    scroll='1m',
    index=INDEX_NAME
)
    
ids = [hit['_id'] for hit in hits]

Brian Low · Answer 5 · 2014-08-18T17:07:26.347

14

Another option

curl 'http://localhost:9200/index/type/_search?pretty=true&fields='

will return _index, _type, _id and _score.

edited Aug 18 '14 at 17:07

answered Aug 18 '14 at 06:43

Brian Low

11,605
4
58
63

3

-1 Better to use scan and scroll when accessing more than just a few documents. This is a "quick way" to do it, but won't perform well and also might fail on large indices – PhaedrusTheGreek Apr 19 '16 at 20:04
On 6.2: "request ... contains unrecognized parameter: [fields]" – Serp C Feb 15 '18 at 20:44
1

Is there any way to get only _id field? – user29671 Aug 28 '18 at 09:59
1

`stored_fields` instead of `fields` for newer versions – Abhishek Kumar Jan 09 '19 at 12:20

Alex Moore-Niemi · Answer 6 · 2020-02-25T15:51:04.443

I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.

The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.

Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.

# note below I have es, index, and cluster_name variables already set

max_workers = 14
scroll_slice_ids = list(range(0,max_workers))

def get_doc_ids(scroll_slice_id):
    count = 0
    with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
        query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
        scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
        for doc in scan:
            count += 1
            results_file.write((doc['_id'] + '\n'))
            results_file.flush()

    return count 

if __name__ == '__main__':
    print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
    with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        doc_counts = executor.map(get_doc_ids, scroll_slice_ids)

If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.

Is it possible to use multiprocessing approach but skip the files and query ES directly? — ruslaniv, Nov 30 '22 at 08:37
Of course, you just remove the lines related to saving the output of the queries into the file (anything with `results_file` var). — Alex Moore-Niemi, Dec 01 '22 at 03:35
For some reason it returns as many document id's as many workers I set. So if I set 8 workers it returns only 8 ids — ruslaniv, Dec 01 '22 at 06:45

score 3 · Answer 7 · edited Jun 10 '20 at 19:25

3

For Python users: the Python Elasticsearch client provides a convenient abstraction for the scroll API:

from elasticsearch import Elasticsearch, helpers
client = Elasticsearch()

query = {
    "query": {
        "match_all": {}
    }
}

scan = helpers.scan(client, index=index, query=query, scroll='1m', size=100)

for doc in scan:
    # do something

edited Jun 10 '20 at 19:25

sumit kumar

150
1
2
13

answered Nov 26 '19 at 15:47

sdcbr

7,021
3
27
44

score 2 · Answer 8 · answered May 28 '15 at 07:24

2

you can also do it in python, which gives you a proper list:

import elasticsearch
es = elasticsearch.Elasticsearch()

res = es.search(
    index=your_index, 
    body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})

ids = [d['_id'] for d in res['hits']['hits']]

answered May 28 '15 at 07:24

Alix Martin

332
1
5

1

question was "Efficient way to retrieve all _ids in ElasticSearch". You set it to 30000 ... What if you have 4000000000000000 records!!!??? – pregmatch Dec 22 '21 at 11:57

score 2 · Answer 9 · answered Jan 16 '16 at 22:39

Inspired by @Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es, 
                 query={"query": {"match_all": {}}, "fields" : []},  
                 index="your-index-name", doc_type="your-doc-type"): 
        print dobj["_id"],

score 0 · Answer 10 · answered Dec 08 '20 at 01:55

This is working!

def select_ids(self, **kwargs):
    """

    :param kwargs:params from modules
    :return: array of incidents
    """
    index = kwargs.get('index')
    if not index:
        return None

    # print("Params", kwargs)
    query = self._build_query(**kwargs)
    # print("Query", query)

    # get results
    results = self._db_client.search(body=query, index=index, stored_fields=[], filter_path="hits.hits._id")
    print(results)
    ids = [_['_id'] for _ in results['hits']['hits']]
    return ids

score -4 · Answer 11 · edited Mar 01 '19 at 10:03

-4

Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}

edited Mar 01 '19 at 10:03

Anth12

1,869
2
20
39

answered Oct 04 '16 at 08:47

Ankireddy Polu

1,824
16
16

1

inefficient, especially if the query was able to fetch documents more than 10000 – Suomynona Nov 22 '19 at 03:30

Efficient way to retrieve all _ids in ElasticSearch

11 Answers11

Linked