3

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?

Wei Shen
  • 31
  • 1
  • 2
  • Are you working with a particular language or client library to communicate with elastic? – jheth Aug 26 '14 at 01:09
  • http://stackoverflow.com/questions/17497075/efficient-way-to-retrieve-all-ids-in-elasticsearch – coderz Nov 11 '15 at 22:39

3 Answers3

3

For that amount of documents, you probably want to use the scan and scroll API.

Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:

es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
        print res['_id']
Anton
  • 4,411
  • 1
  • 15
  • 18
  • scan is deprecated in ES 2.1.0. So, we might need to use scroll API only. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#scan – sangheestyle Mar 17 '16 at 19:05
3

I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.

If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.

link to knapsack: https://github.com/jprante/elasticsearch-knapsack

edit: hopefully you are not doing this often... or this may not be a viable solution

coffeeaddict
  • 858
  • 5
  • 3
0

First you can issue a request to get the full count of records in the index.

curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'

{
  "count" : 1408,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.

Find a good page size that you can consume without running out of memory and increment the from each iteration.

curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'

Example item response:

{
  "_index" : "documents",
  "_type" : "document",
  "_id" : "1341",
  "_score" : 1.0
},
...
jheth
  • 380
  • 1
  • 8
  • 1
    Deep pagination using size and from is very heavy. When you get to the "?size=1000&from=19999000" you will realize. – Anton Aug 26 '14 at 09:25
  • Thanks Anton I haven't tried this on such a large data set. What do you recommend instead? – jheth Aug 26 '14 at 21:09
  • I recommend the scan and scroll API as mentioned in my answer. – Anton Aug 27 '14 at 10:06