0

Background:

I have an existing blue/green architecture setup that uses two load balancers (blue/green), and does swapping based on Route53 config. Each set of applications in these load balancers have their own distinct blue/green target groups, and their own security groups (shared for blue/green).

For various reasons, I'm looking into changing this setup to use a single load balancer, but just swap the target groups.

I'm using the CDK to do this. I'm re-using the existing security groups for the new instances in this load balancer. When I deploy the CloudFormation, it confirms that I want to modify the security groups to allow ingress from the new load balancer. Make sense; yes, I do.

Here is a snippet from the resulting cloudformation for the AWS::EC2::SecurityGroupIngress addition.

Resources:
  somerandomnamebythecdk:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      IpProtocol: tcp
      Description: Load balancer to target
      FromPort: 80
      GroupId: sg-XXXXX
      SourceSecurityGroupId: sg-YYYYYY
      ToPort: 80
    Metadata:
      aws:cdk:path: somerandomnamebythecdkformetadata:80

Where sg-XXXXX was a parameter to the CDK and sg-YYYYYY is the SG of the ALB, which is also managed outside of this particular CloudFormation. The SG itself is not a part of this Cloudformation.

So then a "deployment" via the CDK could be changing the CloudFormation template to swap the blue/green target groups that are in use by the load balancer. I'm using a test-listener port (8443). So whichever target group is "secondary" gets wired into that listener. If I'm totally shutting down one side, then I map both the prod and test listener (443/8443) to the same (active) target group. This kind of mimics what happens with an ECS Blue/Green Deployment. However, in my case, I'm doing this with EC2 instances/AMIs.

Issue

When I tear down this cloudformation, it correctly undoes the AWS::EC2::SecurityGroupIngress addition. However it also appears to remove the existing AWS::EC2::SecurityGroupIngress rules from the current blue/green load balancers, which then makes those applications unreachable/unhealthy in the blue/green ALBs because they can't get a successful healthcheck on port 80.

I can tell in CloudTrail that there is a RevokeSecurityGroupIngres event happening. However, its running for one of the blue/green ALBs instead-of/in-addition-to the new/single ALB.

Here is a sanitized version of that event:

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "ACCOUNT:MYUSER",
        "arn": "arn:aws:sts::999999999999:assumed-role/ROLE/MYUSER",
        "accountId": "999999999999",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "ACCOUNT",
                "arn": "arn:aws:iam::999999999999:role/ROLE",
                "accountId": "999999999999",
                "userName": "ROLE"
            },
            "webIdFederationData": {},
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2020-11-18T15:39:08Z"
            }
        },
        "invokedBy": "cloudformation.amazonaws.com"
    },
    "eventTime": "2020-11-18T18:14:02Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RevokeSecurityGroupIngress",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "requestParameters": {
        "groupId": "sg-XXXXX",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "groupId": "sg-ZZZZZ"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "GUID1",
        "_return": true,
        "unknownIpPermissionSet": {}
    },
    "requestID": "GUID1",
    "eventID": "GUID2",
    "eventType": "AwsApiCall",
    "recipientAccountId": "999999999999"
}

Question

Is there is something I'm doing wrong that causes the tear down to delete ingress rules that aren't associated to the CloudFormation?

Has anyone seen this before?

brendonparker
  • 838
  • 8
  • 19
  • Change of any property, except Description, in [AWS::EC2::SecurityGroupIngress](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-security-group-ingress.html#cfn-ec2-security-group-ingress-groupid) requires replacement of the resource. Not sure if this is what you are experiencing? – Marcin Nov 19 '20 at 22:24
  • @Marcin thanks. Not quite. In this case it is an entirely new SecurityGroupIngress being added to an existing SecurityGroup. I'd expect the tear down to remove the SecurityGroupIngress. But wouldn't expect it to remove previously existing SecurityGroupIngress that were either manually configured, or added by a different CloudFormation template. – brendonparker Nov 19 '20 at 22:44
  • I think I know what you are referring. It seems to me that what you are observing is because your existing `SecurityGroupIngress` is not fully managed by the currently updated stack. If it was added manually outside of it, including some other template, then it will be considered as an external change. This basically is considered as a [stack/resource drift](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift.html). – Marcin Nov 19 '20 at 23:58
  • If that is indeed the case, then I assume I would need to bring everything under the same cloud formation template, which would become even more unwieldy :( – brendonparker Nov 20 '20 at 01:21
  • I think so, if I understand the issue correctly. You can read more about the drift and what issues it causes to verify that it it applies to your problem. – Marcin Nov 20 '20 at 01:25

0 Answers0