Kubernetes Stateful Sets - Mapping existing IDs to persistent/stateful pods

Question

Thanks in advance to all those who help.

Hello, I have somewhat of a unique problem, its rather lengthy to explain but I think if solved we can expand the use-cases of Kubernetes. I think I know how to solve it, but I'm not sure if Kubernetes Stateful Sets supports the solution. Let me elaborate the domain of the problem, the problem itself, and then some of my sample solutions and maybe someone can help fill the gaps.

The Domain Space:

I have a set of Accounts (external to kubernetes) {Account_A, Account_B, Account_C, etc.}
Accounts can be active or inactive at anytime (Important: in NO PARTICULAR ORDER).
If activated, a pod is deployed which serves that account, and keeps a persistent volume with all of that accounts work-space/data. That account is interacted with by its unique pod identifier, and IP.
If deactivated, the pod is removed but the data persists so that the next time it is activated, it will be bound to the same persistent-volume-claim and therefore have access to its previous data.
If reactivated, a pod is redeployed that uses the previous persistent-volume-claim to resume working on the data from previous sessions

Obviously, looking at the available Kubernetes tools/objects, a stateful-set with headless-service is the ideal way of approaching this. It supports unique pods, which are assigned unique IPs, and supports persistent volumes. It also supports dynamically provisioning persistent-volumes through

The Problem:

As mentioned in the domain, accounts can be active in any order, but stateful-set pods are ordinal, meaning pod_1 has to be active for pod_2 to be active for pod_3 to be active, etc. We can't have pod_1 active and pod_3 active while pod_2 is inactive. This means if I enable Account_A, then Account_C, a pod named pod_1 will be created, and then a pod named pod_2 will be created.

Now you might say that this isn't a problem. We just keep a map that maps each account to the relative pod_number. For example, Account_A -> pod_1 and Account_C -> pod_2

Why is this a problem? Because when specifying a volumeClaimTemplate in the stateful-set, persistent-volume-claims use the pod's name as their identifier when being created. Which means that only the pod with the same name can access the same data. The data(volumes) is bound based on a pod's name, rather than the account. This creates a disconnect between accounts and their persistent volumes. Any pod with name pod_2 will always have the same data that pod_2 has always had, regardless of which account was "mapped" to pod_2.

Let me further illustrate this with an example:

1. Account_A=disabled, Account_B=disabled, Account_C=disabled (Start state, all accs disabled)
2. Account_A=enabled, Account_B=enabled, Account_C=enabled -> (All accounts are enabled)
    pod_1 is created (with volume_1) and mapped to Account_A
    pod_2 is created (with volume_2) and mapped to Account_B
    pod_3 is created (with volume_3) and mapped to Account_C
3. Account_A=disabled, Account_B=disabled, Account_C=disabled (All Accounts are disabled)
    pod_1 is deleted, volume_1 persists
    pod_2 is deleted, volume_2 persists
    pod_3 is deleted, volume_3 persists
4. Account_A=enabled, Account_B=disabled, Account_C=enabled (re-enable A and C but leave B disabled)
    pod_1 is created (with volume_1) and mapped to Account_A (THIS IS FINE)
    pod_2 is created (with **volume_2**) and mapped to Account_C (THIS IS **NOT** FINE)

Can you see the issue? Account_C is now using the data-store that should belong to Account_B (volume_2 was created and used by account_b not Account_C), because of the fact that volumes/claims are mapped by name to pod names, and pods have to be ordinal i.e. pod_1 then pod_2.

Potential Solutions

Be able to support custom non-ordinal names for pods in a stateful-set. (Simplest and most effective)

This solves everything, and keeps the benefits and tools of statefulsets. I can name my pods what I want when launched, so that when an account is enabled I just launch a pod with that accounts name, and the volume that is created is mapped to any pod with that same name. I've looked and can't seem to find a way to do this.

(p.s.) I know that stateful-sets are supposed to be ordinal for ordering guarantees, but you can turn this off with "podManagementPolicy: Parallel"
Some way to do this with labels and selectors instead?

I'm rather new to Kubernetes, and I still don't fully understand all the moving parts. Maybe there's some way to use labels in my volumeClaimtemplate, to have volume claims attach to volumes with a certain label. i.e. Account_C mapped to pod_2 can request volume_3 because volume_3 has a label with: account=Account_C. I'm currently looking into this. If it helps, my persistent volumes are provisioned dynamically using this tool: https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client Maybe I can somehow modify it so that it adds certain labels to the persistent-volumes it creates.
Ditch statefulsets and deployments and just add pods manually to the cluster

This is not a great solution since according to docs, pods shouldn't really exist without a statefulset or deployment as a parent, and it also removes all the built-in functionality of persistent-volumes and dynamic volume provisioning, etc. For me the dealbreaker is not having volumeClaimTemplates which create or bind to an existing volumeClaim when deployed. If I could recreate this somehow, this solution would work.
Create custom Kubernetes object to do this for me

This is unideal, since it would be a lot of work and I wouldn't even know where to begin. I would also be recreating the exact same thing as a stateful-set except without the ordinal-mapping. I would have to figure out how to writeoperators and replicasets, etc. Seems like overkill for a rather simple problem.
Have the persistent storage be mounted from within the pod's container This is a last resort since it completely removes the need for kubernetes. It also means I have to send the connection information to the container within the pod, and opens up an entire can of worms with security and authentication there.

I will update with anything else I find or think of. Thanks to all who help.

Can't you just write all of the persistent data of a pod under {directory}/{pod_name}, when a pod is deactivated it exports a {directory}/pod_name archive to a storage location and when the account mapped to that pod is reactivated it downloads the archived data and restores it ? — Alexandre Daubricourt, Feb 21 '20 at 23:16
@AlexandreDaubricourt yes but I feel like that would not be the right approach. What happens if a pod fails due to error. Proper shutdown doesn't occur and data is lost. In Potential Solution #5 I mention having the pod handle its own storage, but I want to leverage Kubernetes and its built-in functionality as much as possible. Nfs is rather lacking in security, and I have to use nfs so I would prefer if Kubernetes handled the data mounting so that users cant arbitrarily access other data from within their pod so to speak — user2896438, Feb 22 '20 at 00:23
@PatrickW Yeah I thought about that but isnt that a lot of overhead per pod? I mean each pod having its own full blown replication controller, replication set, etc. The goal of this project is to be able to scale as much as possible. Do you know how much extra overhead this would cause? Maybe its negligeable and im overthinking? — user2896438, Feb 22 '20 at 00:26
It does additional overhead, but then again so does adding CRDs. Since there is no out of the box solution, any fix will have additional overhead — Patrick W, Feb 22 '20 at 00:28
Yeah, it just seems like such an obvious feature that would apply to so many use-cases that I feel like there should be an out of the box solution. Was kinda hoping someone would read this and point out something I was obviously missing (I learned all of Kubernonsense in about a day so my knowledge is still lacking). I'll go feature-request it I guess. Thanks for the replies. — user2896438, Feb 22 '20 at 00:39

ashu · Accepted Answer · 2020-02-22T05:36:55.907

It seems to me that you're convinced that StatefulSets is a step in the right direction but that's not entirely true.

StatefulSets have ordinality due to two reasons:

Creating ordered PersistentVolumeClaims
Being able to create FQDN endpoints for individual pods (using a headless service)

In your case, neither seems to be true. You just need stable storage per account. While you think that #4 from your potential solutions is most unideal, it is the most "Kubernetes native" way to do it.

Solution

You need to write a component that manages a StatefulSet or even a Deployment per account. I say deployment because you don't need stable network identifiers for each pod. A ClusterIP service per account will be adequate for communication.

In the Kubernetes world, these components are called controllers (without custom objects) and operators (with custom objects/manages applications).

You can start by looking into operator-sdk and controller-runtime. Operator SDK aggregates commonly used functionalities on top of controller-runtime as a framework. It also makes developers' life easier by incorporating kubebuilder which is used to generate CRD and K8S API code for custom objects. All you need to define is structs for your custom object and a controller.

Take a look at Operator SDK, you'll find that creating and managing custom objects is not that hard.

Custom object based flow for your problem

This is how I imagine the flow of your operator from what I understood in your write up.

One Account object maps to one account. Each object has unique metadata that maps it to its account. It should also have an active: boolean in its spec.
Watch for custom Account objects
Whenever you need to create a new account, use Kubernetes APIs to create a new Account object (will trigger an Add event in the controller) and then your controller should
- Create/Update a PersistentVolumeClaim for the account
- Create/Update the Deployment with the volume from created PVC specified in the Pod template
- Catch: Add events are also received for old objects when controller restarts. So the action taken should be "Create or Update".
Set the active field in your custom object to false for deactivating the account (a Modify event in the controller) and then your controller should
- Delete the deployment without touching the volume at all.
Set the active field to true for reactivating the account. (modify event again)
- Recreate the deployment with the same volume specified in the Pod template
Delete the Account object to clean up underlying resources.

While all of this might not make perfect sense right away, I would still suggest you to go through operator-sdk's docs and examples. IMO, that would be a leap in the right direction.

Cheers!

Thanks for helping point me in the right direction. This made a lot of sense and was explained well. Also providing tools to properly develop custom objects is very useful, as otherwise I would have probably gotten lost searching for the right tools/starting point. I will definitely implement a custom controller/operator to automate the proper workflow, as well as custom objects to work with. — user2896438, Feb 22 '20 at 06:34
One final question, what do you think of using Jobs instead of Deployments as the pod-parent? Since typically these accounts will only be run for a few hours per day, and we could model job dispatch being accounts set to active, and job completion being account being disabled (or logged out). This would maintain the replication-requirement using native processes (job restarts). They could all run off a headless-service to expose and provide unique access. I would still create a custom controller as you suggested to do the proper resource provisioning (volumes), teardown, etc. — user2896438, Feb 22 '20 at 06:34
Also logout/disabling event is generated from within the container/pod anyways, so simply exiting from the container process successfully could signal job-complete, and then my controller could watch for job completion and set account sttatus to disabled. — user2896438, Feb 22 '20 at 06:34
Finally the question: Do you think this is a stretch/abuse of what a Job is supposed to be? In my mind it eliminates the need for the overhead of an entire deployment per account, and a Nodeport service per deployment as well. A quick simple answer of yes or no would make my day, no need to go in-depth — user2896438, Feb 22 '20 at 06:38
@user2896438 You can definitely chose to use Jobs instead of Deployments but you'd have to ensure that main process of the Pod doesn't exit before you'd want it to, to prevent Pod from being removed. As for the overhead part, Jobs and Deployment would both need to be deployed as an object so overhead is similar. It's totally up to you. And about the NodePort bit, you can use `ClusterIP` with [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/). That way, it'd be easier to manage, IMO. Again, I am not sure of the use-case, so `NodePort` might as well be more suited. — ashu, Feb 22 '20 at 07:07
To anyone following this discussion wanting to get into operators, these were the most helpful resources in getting an operator up and running: https://www.youtube.com/watch?v=8_DaCcRMp5I https://learn.openshift.com/operatorframework/go-operator-podset/ — user2896438, Feb 24 '20 at 09:26

score 0 · Answer 2 · answered Feb 26 '20 at 10:04

After a few days of deep dive into learning operators, @Ashu's answer is the best solution to the problem and opens Kubernetes up to solving almost any scenario that one may want implemented.

Below are the most helpful resources for learning operators as of early 2020:

youtube.com/watch?v=8_DaCcRMp5I learn.openshift.com/operatorframework/go-operator-podset https://itnext.io/a-practical-kubernetes-operator-using-ansible-an-example-d3a9d3674d5b

I strongly recommend going through both of those resources fully (and coding alongside), before attempting to create your own operator. Also, if you're "newer" to golang, definitely go through the ansible approach, EVEN IF you want to make your own golang operator. Ansible's approach is more intuitive and the concepts 'clicked' very quickly when playing around with that.

As for Golang vs Ansible

Golang: Slightly more control, but much more complexity, tediousness and nuance

Ansible: Very intuitive, solves operators in a kubernetes high-level way, modular/reusable

Also #kubernetes-operators slack channel is invaluable

Kubernetes Stateful Sets - Mapping existing IDs to persistent/stateful pods

2 Answers2

Solution

Custom object based flow for your problem