1

I am trying to reindex an index of 200M of documents from cluster A to cluster B. I used the Reindex API with a remote source and everything worked fine. In the menwhile of my reindex some documents were added into the cluster A so I want to add them as well into the cluster B.

I launched again the reindex request but it seems that the reindex process is taking a lot, like if it was reindexing everything again.

My question is, is the cluster reindexing from scratch all the documents, even if they didn't change ?

My elasticsearch version is the 5.6

Indexing rate

Document deletion rate

ale_tri
  • 333
  • 4
  • 9

1 Answers1

4

The elasticsearch does not know there is a change in the documents or not. So it tries to have each document completely in both indices. If you have a field like insert_time in your data, you can use reindex with query to limit the part of index of A to become reindex on B. This will let you use your older reindex and finish it faster. Reindex by query would be something like this:

POST _reindex
{
  "source": {
    "index": "A",
    "query": {
       "range": {
          "insert_time": {
              "gt": "time you want"
      }
    }
  },
  "dest": {
    "index": "B"
  }
}
Saeed Nasehi
  • 940
  • 1
  • 11
  • 27
  • 1
    @Alessio Trivisonno : You can use it by modifying the query to point to the added data – Bouraoui KACEM Oct 20 '20 at 12:37
  • Thank you for your answers, as I don't have a timestamp value I couldn't use this solution. For the future it would be a good idea to add a default _timestamp as suggested here: https://stackoverflow.com/questions/17136138/how-to-make-elasticsearch-add-the-timestamp-field-to-every-document-in-all-indic – ale_tri Oct 20 '20 at 13:27
  • You're Welcome my friend! – Saeed Nasehi Oct 20 '20 at 15:41