0

I’m writing to ask for help in improving the custom plugin upgrade process for our Kubernetes StatefulSet running Vault.

Our current setup is as follows:

  • We have developed our own plugins for Vault.
  • We have 3 replicas of the Vault pod in the StatefulSet with the “RollingUpdate” strategy.
  • When a pod starts running, it checks in its init container if it has a new plugin version and, if so, it upgrades the plugin by registering its checksum.
  • The main pod container just runs the Vault server.

One of the possible upgrade scenarios is as follows:

  1. A new Vault image is updated in the StatefulSet.
  2. The Vault-2 pod restarts. It was the leader pod. Now Vault-1 is selected to be the leader pod.
  3. Vault-2 finds that it’s running with a new plugin version that is different from the currently registered version.
  4. Vault-2 registers the new plugin version.
  5. Vault-2 starts running the main container with the new vault version and enters standby mode.
  6. Vault-1 restarts. Vault-0 becomes the active pod and the leader.
  7. Vault-0 cannot start running the plugin because it has the old binary that doesn’t match the new registered checksum.
  8. Vault-1 starts running the new vault version and enters standby mode.
  9. Vault-0 restarts. Vault-2 is selected to be the leader pod.
  10. Vault-2 starts running the new plugin version.

In this scenario, there is downtime from step 4 (or even step 2) to step 10 because the leader pod can’t serve requests to the plugin (checksum does not match). It can be up to 2 minutes. This is the worst-case scenario. Sometimes, Vault-2 is immediately selected as the leader, in which case there is almost no downtime.

I’m wondering how we can improve the worst-case scenario to decrease the downtime. Thank you in advance

PS I found that sometimes a request to the leader pod that runs an old plugin version can succeed and sometimes the same request fails with the error message failed to run existence check (checksums did not match) What determines whether the request succeeds or fails?

Tantre
  • 33
  • 2
  • 9
  • Hi Tantre, and welcome to Stackoverflow. Your question is written as an enhancement request, which should be submitted in Hashicorp's Github repository (or maybe at discuss.hashicorp.com). I will answer it as if you asked "How do I make sure my custom Hashicorp Vault plugin" is up to date in every node when upgrading"? – ixe013 Aug 30 '23 at 01:29
  • @ixe013 I asked in discussions. https://discuss.hashicorp.com/t/custom-plugin-upgrade-in-kubernetes-statefulset/57367/2 Unfortunately, I didn't get any solution – Tantre Aug 30 '23 at 06:42

1 Answers1

0

You have everything you need to make this work with the current version of Vault. But you must play by the rules. Here they are, with links to a Makefile that I wrote to implement them:

  1. Every build of your plugin must have its version in the filename.
  2. Deploy your Vault image with both versions of the plugin to every node. Could be a container, a Linux service deployed with Ansible, it does not matter. Make it a single unit with everything in it.
  3. Wait (and retry if required) that the deployment is complete
  4. Register and activate the new plugin
  5. Let Vault reach consensus on the new version to use.

After you reach step 3, you are in a stable state where leader changes can happen without having to worry about the actual plugin version that is registered. Registering the new plugin will be atomic and replicated through consensus, so either every node will get it, or none of them will.

ixe013
  • 9,559
  • 3
  • 46
  • 77
  • What do you mean by "the current version of Vault"? We run Vault 1.11.6. Does it support multiple plugin versions? – Tantre Aug 30 '23 at 07:01
  • No, it does not, What I mean is that the steps above work with every version of Vault. I would even argue that *this is a software release/distribution problem*, not specific to Vault. You can't have your configuration rely on something that you can't guarantee is going to be present. – ixe013 Aug 30 '23 at 13:39