Atomic alias swap fails with index_not_found_exception on a totally unrelated index

Question

I want to replace and index with zero-downtime, as described in the ES documentation.

I am doing so by:

creating a new index my_index_v2 with the new data
refreshing the new index
then swapping them in an atomic operation, by performing the following request:

POST /_aliases

{
    "actions": [
        { "remove": { "index": "*", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

This works as expected, except when it randomly fails with 404 response. The error message is:

{
   "error": {
      "root_cause": ... (same)
      "type": "index_not_found_exception",
      "reason": "no such index",
      "resource.type": "index_or_alias",
      "resource.id": "my_unrelated_index_v13",
      "index": "my_unrelated_index_v13"
   },
   "status": 404
}

Afterwards, and only if it the swap worked, we delete the now unused indices that were associated with this and only this alias.

The whole operation happens periodically every few minutes. Similar operations to the one described might happen at the same time in the cluster, on other aliases/indices. The error happens randomly, every several hours.

Is there a reason why these operations would interfere with each other? What is going on?

EDIT: clarified the DELETE step at the end.

@Lupanoide that would probably work but it's not what I want. I need the alias to point to one index only. What do you think this would prove anyway? — istepaniuk, Mar 05 '18 at 09:01
are you sure that there is no template, where your "my_unrelated_index_v13" defines aliases in the background? — Eirini Graonidou, Mar 05 '18 at 10:44
@Eirini there are no templates at all. "my_unrelated_index_v13" is an index that could have been deleted concurrently in a similar operation. Our workers are single threaded PHP CLI commands, so this interaction is definitely happening within the cluster. — istepaniuk, Mar 05 '18 at 11:42
if you do execute delete index commands and then later you try to remove an index from an alias (and that index has already been deleted), then you get a 404 as expected. I am not sure where is the problem with that. — Eirini Graonidou, Mar 05 '18 at 12:47
@EiriniGraonidou The response in the question is for the POST request in the question. There is nothing in the request mentioning the offending index, it is unexpected to get an error about something you are not even asking about, and shouldn't matter for the requested atomic operation. — istepaniuk, Mar 05 '18 at 13:12
@istepaniuk you said `"my_unrelated_index_v13" is an index that could have been deleted concurrently in a similar operation`. Are you sure this is the concurrent operation that _could have lead_ to the index_not_found situation? For your reference, the deletion of an index is visible in the master elected node logs. — Andrei Stefan, Mar 06 '18 at 07:59
@AndreiStefan the unrelated index is any index that's concurrently going through this same process (`create new` -> `remove+add` -> `delete old`), these are the only operations we do. The swap never fails when run alone, only when concurrent. I am not certain whether this `remove+add` fails with the concurrent `remove+add` or with the `delete old`, I suspect the later as it is of course a delete. AFAIK deleting other indices, even ALL other ones, should not affect this operation. — istepaniuk, Mar 06 '18 at 12:28
I could not manage to reproduce the error, but I do believe you because I saw other posts about not delivering the correct bad request error message. https://github.com/elastic/elasticsearch/pull/23153 it seems to be fixed with version 5.3 ? could you give it a shot? — Eirini Graonidou, Mar 08 '18 at 19:47
@EiriniGraonidou I can't reproduce on my laptop either. The error happens on a production cluster that has considerable load. I will try to isolate this on 2.4 first. It looks indeed like an ES bug. Our current workaround is simply retrying the operation. Nasty. — istepaniuk, Mar 10 '18 at 18:19

score 0 · Accepted Answer · answered Feb 17 '23 at 09:11

This is difficult to reproduce on a local environment because it seems to only happen on highly concurrent scenarios. However... as pointed out by @Eirini Graonidou in the comments, this really looks like an ES bug, solved in PR 23153

From the pull request (emphasis mine):

This either leads to puzzling responses when a bad request is sent to Elasticsearch (if an index named "bad-request" does not exist then it produces an index not found exception and otherwise responds with the index settings for the index named "bad-request").

This does not explain the "bad request" situation, but definitely explains why the error message does not make sense.

More importantly: Upgrading elasticsearch solves this issue

I am getting the same issue with 8.6.2 version. If you rotate a lot of indexes at the same time, alias API returns 404 for unrelated indexes... — KiraLT, Aug 01 '23 at 19:00

Atomic alias swap fails with index_not_found_exception on a totally unrelated index

1 Answers1

Linked