Terraform: state management for multi-tenancy

Question

As we're in progress of evaluating Terraform to replace (partially) our Ansible provisioning process for a multi-tenancy SaaS, we realize the convenience, performance and reliability of Terraform as we can handle the infrastructure change (adding/removing) smoothly, keeping track of infra state (that's very cool).

Our application is a multi-tenancy SaaS which we provision separate instances for our customers - in Ansible we have our own dynamic inventory (quite the same as EC2 dynamic inventory). We go through lots of Terraform books/tutorials and best practices where many suggest that multi environment states should be managed separately & remotely in Terraform, but all of them look like static env (like Dev/Staging/Prod).

Is there any best practice or real example of managing dynamic inventory of states for multi-tenancy apps? We would like to track state of each customer set of instances - populate changes to them easily.

One approach might be we create a directory for each customer and place *.tf scripts inside, which will call to our module hosted somewhere global. State files might be put to S3, this way we can populate changes to each individual customer if needed.

score 12 · Answer 1 · edited May 23 '17 at 11:54

Terraform works on a folder level, pulling in all .tf files (and by default a terraform.tfvars file).

So we do something similar to Anton's answer but do away with some complexity around templating things with sed. So as a basic example your structure might look like this:

$ tree -a --dirsfirst
.
├── components
│   ├── application.tf
│   ├── common.tf
│   ├── global_component1.tf
│   └── global_component2.tf
├── modules
│   ├── module1
│   ├── module2
│   └── module3
├── production
│   ├── customer1
│   │   ├── application.tf -> ../../components/application.tf
│   │   ├── common.tf -> ../../components/common.tf
│   │   └── terraform.tfvars
│   ├── customer2
│   │   ├── application.tf -> ../../components/application.tf
│   │   ├── common.tf -> ../../components/common.tf
│   │   └── terraform.tfvars
│   └── global
│       ├── common.tf -> ../../components/common.tf
│       ├── global_component1.tf -> ../../components/global_component1.tf
│       ├── global_component2.tf -> ../../components/global_component2.tf
│       └── terraform.tfvars
├── staging
│   ├── customer1
│   │   ├── application.tf -> ../../components/application.tf
│   │   ├── common.tf -> ../../components/common.tf
│   │   └── terraform.tfvars
│   ├── customer2
│   │   ├── application.tf -> ../../components/application.tf
│   │   ├── common.tf -> ../../components/common.tf
│   │   └── terraform.tfvars
│   └── global
│       ├── common.tf -> ../../components/common.tf
│       ├── global_component1.tf -> ../../components/global_component1.tf
│       └── terraform.tfvars
├── apply.sh
├── destroy.sh
├── plan.sh
└── remote.sh

Here you run your plan/apply/destroy from the root level where the wrapper shell scripts handle things like cd'ing into the directory and running terraform get -update=true but also running terraform init for the folder so you get a unique state file key for S3, allowing you to track state for each folder independently.

The above solution has generic modules that wrap resources to provide a common interface to things (for example our EC2 instances are tagged in a specific way depending on some input variables and also given a private Route53 record) and then "implemented components".

These components contain a bunch of modules/resources that would be applied by Terraform at the same folder. So we might put an ELB, some application servers and a database under application.tf and then symlinking that into a location gives us a single place to control with Terraform. Where we might have some differences in resources for a location then they would be separated off. In the above example you can see that staging/global has a global_component2.tf that isn't present in production. This might be something that is only applied in the non production environments such as some network control to prevent internet access to the environment.

The real benefit here is that everything is easily viewable in source control for developers directly rather than having a templating step that produces the Terraform code you want.

It also helps follow DRY where the only real differences between the environments are in the terraform.tfvars files in the locations and makes it easier to test changes before putting them live as each folder is pretty much the same as the other.

With this approach you would be running terraform inside each folder or from the root? I'm asking because depending on that, the state files might be stored in the root path or in each folder. — Luis Ortega Araneda, Sep 20 '17 at 14:42
You can't run Terraform from a parent folder. Terraform only works with what's in the current directory. As it happens we have some helper scripts that are at the root of the repo that `cd` into the location we want to act on and then run `terraform` CLI commands from there. — ydaetskcoR, Sep 20 '17 at 14:48
Yes you can, I do it all the time... `terraform plan path/to/something` — Luis Ortega Araneda, Sep 20 '17 at 15:30
But thanks. I get it, with a script and doing `cd` into a folder, I get a state file inside each folder, which is what I want. There is a flag that could also place the state-file in the folder, from the root `terraform plan path/to/something -state=path/to/something`. — Luis Ortega Araneda, Sep 20 '17 at 15:32
How would we implement this solution if we have to use Gitlab CI/CD with Hashicorp Vault? How will the pipelines switch context? would it not make it complex and a sitting time bomb to mess up things? Also, If we use Azure Storage Accounts, how can we safe guard that the state files are not mixed up and are properly secure with backup/recovery in place? — Buggy B, May 02 '21 at 01:25
@BuggyB That feels like a separate, tighter scoped question. You might find it useful to link back to this question in it though. — ydaetskcoR, May 03 '21 at 17:10
Thank you for this great blueprint! Could you provide an example of resources you would define in _global_? Are those resources that get shared by all customer deployments, e.g. if you wanted to deploy all staging environments into a single VPC? — oschlueter, May 27 '21 at 13:40
Yeah if there's any shared resources then that would be included there. If you don't have any shared resources at all and it's completely shared nothing architecture then you don't need that. — ydaetskcoR, May 27 '21 at 17:23

score 2 · Answer 2 · edited Apr 04 '17 at 12:46

Your suggested approach sounds right to me, but there are few more things which you may consider doing.

Keep original Terraform templates (_template in the tree below) as versioned artifact (git repo, for eg) and just pass key-values properties to be able to recreate your infrastructure. This way you will have very small amount of copy pasted Terraform configuration code laying around in directories.

This is how it looks:

/tf-infra
├── _global
│   └── global
│       ├── README.md
│       ├── main.tf
│       ├── outputs.tf
│       ├── terraform.tfvars
│       └── variables.tf
└── staging
    └── eu-west-1
        ├── saas
        │   ├── _template
        │   │   └── dynamic.tf.tpl
        │   ├── customer1
        │   │   ├── auto-generated.tf
        │   │   └── terraform.tfvars
        │   ├── customer2
        │   │   ├── auto-generated.tf
        │   │   └── terraform.tfvars
...

Two helper scripts are needed:

Template rendering. Use either sed to generate module's source attribute or use more powerful tool (as for example it is done in airbnb/streamalert )
Wrapper script. Run terraform -var-file=... is usually enough.

Shared terraform state files as well resources which should be global (directory _global above) can be stored on S3, so that other layers can access them.

PS: I am very much open for comments on the proposed solution, because this is an interesting task to work on :)

Terraform: state management for multi-tenancy

2 Answers2

Linked