How to retrieve all the document ids from an elasticsearch index

Question

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?

Are you working with a particular language or client library to communicate with elastic? — jheth, Aug 26 '14 at 01:09
http://stackoverflow.com/questions/17497075/efficient-way-to-retrieve-all-ids-in-elasticsearch — coderz, Nov 11 '15 at 22:39

Anton · Answer 1 · 2014-08-27T10:06:07.180

3

For that amount of documents, you probably want to use the scan and scroll API.

Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:

es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
        print res['_id']

edited Aug 27 '14 at 10:06

answered Aug 26 '14 at 09:23

Anton

4,411
1
15
18

scan is deprecated in ES 2.1.0. So, we might need to use scroll API only. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#scan – sangheestyle Mar 17 '16 at 19:05

score 3 · Answer 2 · answered Aug 27 '14 at 00:53

I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.

If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.

link to knapsack: https://github.com/jprante/elasticsearch-knapsack

edit: hopefully you are not doing this often... or this may not be a viable solution

score 0 · Answer 3 · answered Aug 26 '14 at 01:08

0

First you can issue a request to get the full count of records in the index.

curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'

{
  "count" : 1408,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.

Find a good page size that you can consume without running out of memory and increment the from each iteration.

curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'

Example item response:

{
  "_index" : "documents",
  "_type" : "document",
  "_id" : "1341",
  "_score" : 1.0
},
...

answered Aug 26 '14 at 01:08

jheth

380
1
8

1

Deep pagination using size and from is very heavy. When you get to the "?size=1000&from=19999000" you will realize. – Anton Aug 26 '14 at 09:25
Thanks Anton I haven't tried this on such a large data set. What do you recommend instead? – jheth Aug 26 '14 at 21:09
I recommend the scan and scroll API as mentioned in my answer. – Anton Aug 27 '14 at 10:06

How to retrieve all the document ids from an elasticsearch index

3 Answers3

Linked