1

I am trying to put up a VPC architecture for different environments (dev/test/pre-prod/prod) and I am facing an issue with respect to restriction of Elastic IP limits. It would be great to know if the architecture is going in the correct direction in the first place. So let me explain you the details here:

  1. 1 VPC for all environments with 1 Internet Gateway
  2. VPC in one region
  3. 3 Availability Zones with 1 private subnet and 1 utility subnet for each (total of 6 subnets)
  4. 3 NAT Gateways - one for each utility subnet with 3 Elastic IPs assigned to their network interfaces
  5. EC2 Instances (master and node) in each private subnet
  6. Virtual private gateway to connect to corporate network

I am using Terraform to automate this whole infrastructure as code (this doesn't matter too much here). When I run the Terraform script for one environment (let's say dev), the whole infrastructure detailed above is created fine and works good. But now when I run the script for another environment (say test), I run out of Elastic IPs (because there is a limit of 5 EIPs per region).

What's the best way to re-architect this so I can create infrastructure for different environments while not hitting these EIP limits?

Thanks much for your help. Please let me know if more details needed.

Regards, Abdul

Basith
  • 831
  • 1
  • 7
  • 14
  • 1
    If you really need EIP's for all of your instances then you can request a limit increase from AWS support. https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html – Briansbum Jan 30 '18 at 10:21
  • Thanks Briansbum. Currently, the Elastic IPs are assigned to the network interfaces in the public subnet. So my question is: Am I doing the right thing of assigning 3 Elastic IPs for each environment I am creating? Is there a better way around it? – Basith Jan 30 '18 at 10:29
  • 1
    If you don't _need_ HA on the NAT gateways you could get away with a single NAT gateway per VPC. This will be fine until the AZ containing your NAT gateway fails at which point the other AZs now have no way to egress to the internet (or whatever route traverses the NAT gateway). NAT gateways are highly available in the AZ itself so you only need to worry about the AZ failure case which should be rare enough to not really worry about it outside of production. But ultimately you just need to ask AWS to increase your EIP limit for your account+region. – ydaetskcoR Jan 30 '18 at 10:31
  • Thanks @ydaetskcoR. So I can create a NAT Gateway in AZ1 in its public subnet and then allow instances from private subnets of all AZs to talk to the public subnet in AZ1? Is that feasible? This should work for other environments like dev/test/pre-prod/staging but I believe I would still run into the EIP limit issue (3 for prod, 1 for each environment)? I can request for limit increase if that's the way to go but trying to understand if it's the right solution because this is a very common deployment scenario? – Basith Jan 30 '18 at 10:38
  • Yeah basically that. And you just want to have the 0.0.0.0/0 route for all the private subnets go to the single NAT gateway in the VPC – ydaetskcoR Jan 30 '18 at 10:40
  • Thanks @ydaetskcoR – Basith Jan 30 '18 at 10:41
  • @ydaetskcoR. Thinking about it, if that NAT gateway goes down, all the instances will be affected and will become a single point of failure? Hmm any other alternatives? Thanks for your help. – Basith Jan 30 '18 at 10:47
  • 2
    It's if the AZ goes down. The NAT gateways are HA inside their AZ (see https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-comparison.html) but yeah if the AZ containing your NAT gateway fails then the other AZs that shouldn't be impacted are now impacted by the NAT gateway being missing. For me this is fine outside of production but it's a decision you have to make rather than anyone outside can tell you. – ydaetskcoR Jan 30 '18 at 10:51
  • Completely unrelated to EIPs: you really should consider one VPC per environment. There are a couple of benefits to this, but the biggest one is that you can't accidentally configure your dev/test servers to connect to your prod database. – kdgregory Jan 30 '18 at 11:38
  • @kdgregory: I was thinking about it. But there are also two things to consider when we go for that solution: 1. Data transfer costs between VPCs from private subnet instances to NAT gateway 2. Billing complexities even though it's consolidated – Basith Jan 30 '18 at 11:42

2 Answers2

3

I would suggest that each environment be managed in its own AWS Account, rather than mingling all the environments in one account. The additional separation is quite easy when you have automated the infrastructure, and it provides you with an additional level of security and isolation between environments. A hack in one environment would not affect the other environment.

We keep 3 environments this way. Production, development, and a failsafe environment. The fail safe account contains production backups in a different region.

There are multiple benefits to separating environments by accounts. For example:

Rodrigo Murillo
  • 13,080
  • 2
  • 29
  • 50
  • Thanks @Rodrigo M. Glad to hear that you have such a system in place. Will definitely take this into consideration. – Basith Jan 30 '18 at 11:53
  • You bet. I wish I had done this sooner in my environments. There are multiple benefits eg you do not need to give everyone production access, and you can specify some resources only be created in certain environments. Plus Code Spaces https://arstechnica.com/information-technology/2014/06/aws-console-breach-leads-to-demise-of-service-with-proven-backup-plan/ – Rodrigo Murillo Jan 30 '18 at 12:02
  • That CodeSpaces tale is quite frightening but a good share here. Cheers.. – Basith Jan 30 '18 at 12:10
2

As mentioned in the comments the EIP limit is simply you running into the AWS service limit for EIPs so you should talk to AWS about raising it. Running separate workloads in separate AWS accounts as suggested by Rodrigo M is another way to get around service limits but is also good idea for many other reasons as listed in his answer.

As also discussed you might want to consider only running a single NAT gateway in non production VPCs as this will reduce your costs (as well as reduce the EIPs you need).

NAT gateways are highly available inside the availability zone they are placed in but obviously not across the region. This means that if you have a single AZ failure on the AZ that happens to contain your NAT gateway then your other AZs will then lose connectivity through the NAT gateway, spreading the failure beyond the logically separated AZs. If you were to have a NAT gateway for every AZ then when an AZ fails it will only impact that single AZ (which is obviously completely down then).

For myself that lesser HA is fine for non production environments and saves $65 a month per non production VPC. However in production environments I'm happy to eat that small extra cost to reduce the damage caused by an AZ failure along with all the other work I do to avoid single points of failure.

ydaetskcoR
  • 53,225
  • 8
  • 158
  • 177
  • Good points. As I suggested in my answer, with proper tooling and environment separation, certain resources can be constrained or omitted in certain environments. Lower levels of HA in a non-production environment is certainly a good way to reduce costs. – Rodrigo Murillo Jan 30 '18 at 15:32
  • 1
    Yeah, definitely, splitting environments across AWS accounts is definitely worth doing for a myriad of reasons but raising your (soft) service limits is probably the least of these ;) – ydaetskcoR Jan 30 '18 at 15:43