0

I noticed some random errors on some applications hitting vault in the form of {"errors":["local node not active but active cluster node not found"]}.

I managed to reproduced it locally by running

▶ curl --request POST --data '{"jwt": "<my_jwt_token>", "role": "<my_role>"}' https://my_vault_instance:8813/v1/auth/kubernetes/login

repeatedly.

Out of say 4000 requests I had for errors as the above. Although it seems rare, in a production environment environment with multiple entities making requests at vault this can be an issue.

I have a 1 instance vault deployed on GCE using the official module. I have also HA enabled (based on CPU usage). At no point in time is the machine cpu-stressed more than 50%.

What can be causing this?

pkaramol
  • 16,451
  • 43
  • 149
  • 324
  • Sounds like nodes can't reach quorum/consensus. I assume you tested networking from each node to each other's cluster ip:port. Please update your question with the Vault version (Enterprise or Open Source), Storage mechanism and HA mechanism if different from storage. Maybe your leader crashes OOM often enough that a leader election is in progress while requests are coming in, increasing the likelihood of this error. – ixe013 Mar 24 '23 at 20:24
  • I don't see any CPU above ~ 50% so how can it be OOM the cause? I am using open source vault with GCS backend – pkaramol Mar 24 '23 at 21:04
  • Out Of Memory does not need a high CPU usage to occur. Regardless, my point is that the error you are getting might be the symptom of your pods restarting all the time, whatever the reason. The likelihood is greater with Vault open source because even read operations need an elected leader. Maybe look to see if an election was in place when these error occur? – ixe013 Mar 26 '23 at 12:14

0 Answers0