1

I am using, AWS Elasticsearch service(version 6.3). I am interested in changing mapping while re-indexing data from current_index to new_index. I am not trying to upgrade from older Elasticsearch clusters to new one. Both my current_index and new_index are on the same Elasticsearch 6.3 cluster.
I am trying to perform Reindex in place operation by following the information from Elastic documentation
My index contains about 250k searchable documents. When I POST _reindex request using curl,

curl -X POST "aws_elasticsearch_endpoint/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "current_index"
  },
  "dest": {
    "index": "new_index"
  }
}
'

Elasticsearch starts the reindex process(I verify this by performing GET /_cat/indices?v), and I end up getting curl: (56) Unexpected EOF error. The Reindex operation actually works fine. After about 2 hours the doc.count in new_index matches that of current_index and status turns green


If I POST _reindex from Java, I get this error:

java.net.SocketException: Unexpected end of file from server

Only when the document size in my index is small(I tried with like 1k searchable documents) is when the Reindex API returns success-fully as specified here

Ganesh kudva
  • 990
  • 3
  • 13
  • 34

2 Answers2

3

This is because the response takes a long time to return and curl times out. On small data sets, the response comes back before the time out, hence why you're getting a response.

When curl times out, the reindex is still in progress, though, and you can still see how the reindex is doing using this command:

GET _tasks?actions=*reindex&detailed=true

What you can also do is to add ...?wait_for_completion=false to your curl command. ES will create a background task for your reindex operation. The curl command will terminate early and return a taskId that you can then use to regularly check the state of the reindex using the Task API

GET .tasks/task/<taskId>

Also note that in this case, when the task is done, you'll also need to remove the task from the .tasks index, ES will not do it for you.

Val
  • 207,596
  • 13
  • 358
  • 360
  • i suspect this is not due to the curl timeout (i.e client side) since that can be configured using the -m option if the client wants a blocking functionality. This is most probably due to the limitation of aws elasticsearch elb – keety Mar 02 '19 at 06:17
  • Yet the reindex operation is still ongoing in the background and using the Task API the OP can check the status. – Val Mar 02 '19 at 06:18
  • what i mean to say is the above answer seems to suggest increasing the client timeout is an option to . But it is not since aws elasticsearch elb timeout is not configurable. – keety Mar 02 '19 at 06:23
  • Nope, I didn't mention to increase the timeout at any time, I just said that the reindex operation can take time and the client will eventually time out if that lasts too long. There's usually absolutely no gain to wait for the response to come back since you can inspect the status of the background task using the Task API. – Val Mar 02 '19 at 06:25
  • The answer states "curl times out" . At least i get the impression from that you are alluding to client side timeout which is not the case here . – keety Mar 02 '19 at 06:31
  • We don't care whether curl was instructed to time out after a given time or not, the point is that **from the client perspective**, a time out is what happens, whether initiated by curl or forced by AWS. We don't really care about that time out actually, all we want to know is that the background operation is still going on. You're reading too much between the lines. – Val Mar 02 '19 at 06:36
  • I don't think the OP is about **how to check if reindexing is occurring in the background?**. It is more with regard to why he is unable to do a blocking reindex operation. If it was a client side timeout it can be manually configured. Since it is a server side it cannot . Anyways I have added my answer so it should be covered so this debate seems pointless. – keety Mar 02 '19 at 06:47
  • I'm not the one who started it, yet we both agree that it is pointless, indeed, since your answer points to mine ;-) – Val Mar 02 '19 at 07:54
2

AWS Elasticsearch ELB(Elastic Load Balancer) has a timeout of 60 seconds. This is not configurable at the moment and has been a long standing feature request
You can find more details in this aws forum thread

As a result any operation and in this particular case a reindex taking more than 60 seconds would result in a gateway timeout.
As a result it is not possible to block on a long running reindex by increasing client timeout.

For the reindex api the workaround is as suggested by @Val above. That is to use the wait_for_completion=false flag and the steps as mentioned in the Reindex API documentation link : https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#_url_parameters_3

keety
  • 17,231
  • 4
  • 51
  • 56