0

I have a performance issue that I am trying to solve... I am doing a reindex on-the-fly from a source index in AWS managed Elasticsearch 6.2 to a destination index. The source index is currently hundreds of GB in size and likely to be larger in production. As such, the reindex will take some time to complete. I am trying to minimize that as much as possible, as per business requirements. I read that some of the things I can do to speed up a reindex are as follows:

1) Use a judicious number of slices compared with number of shards on the cluster for parallelism (e.g. 10 shards should ideally have no more than 10 slices running, rest is waste and potential overhead)

2) Do not have replica shards on the destination index if you don't need them, this adds work to write data to the cluster

3) Use the correct EC2 instance types in the cluster to accomplish this task

4) Only copy what information you need from the source index to reindex.

Point #4 above is where I need guidance... I am using the Jest API (v.5.3.3) in Java 8. Is there a way to perform a _rendex query but only returning back one or two fields in the _source, so that the actual data I am writing to the destination index is only a fraction of the size of the source?

Nkosi
  • 235,767
  • 35
  • 427
  • 472
BPS
  • 607
  • 8
  • 29

1 Answers1

1

It looks like this is indeed possible, at least in Kibana: I successfully performed a reindex with just adding _source in as part of the source parameter. If that seems a little confusing, here is my query that seemed to work:

POST _reindex?slices=10&wait_for_completion=false
{ "conflicts": "proceed",
  "source":{
    "index": "my_source_idx",
    "_source" : "fieldICareAbout",
    "query": { "bool": {
      "filter": { "bool" : { "must" : [
        { "nested": { "path": "medications", "query": { "bool": { "must":[
           { "terms" : { "mds.rowKey": ["USC_4886F"]} },
           { "range" : { "mds.dates" : { "lte": "2018-01-01", "gte": "2010-08-01"} } },
           { "range" : { "mds.datesCount" : { "gte": 2} } },
           { "script" : { "script" : { "id": "min-occurrence-gap-days-criteria-nested", 
              "params" : {"min_occurs" : 1, "dateField": "mds.dates", "rowKey": ["USC_4886F"], "fromDate": "2010-08-01", "toDate": "2018-01-01", "gapDays": 0}}}}
        ]}}}}
      ]}}
    }}
  },
  "dest": {
    "index": "my_dest_index"
  }
}
BPS
  • 607
  • 8
  • 29