4

I have an ECS cluster consisting of 2 instances in different AZ. One of the many services I run is a SMTP relay. I want to use a Network Load Balancer in front of this service to easily configure other applications to use the relay.

Once I got everything in place, I faced the following issue:

If the container is present on instance 'A' only instance 'B' is able to access it and vice versa, otherwise it times out. So the Network Load Balancer seems to prevent access to a service that lives on the same instance.

Is there something I'm missing here? Is anyone aware of this and have a workaround?

Update: When scaling the service to 2 instances it started to work. I now tend to believe it's related to the Availability Zones.

Laurent Jalbert Simard
  • 5,949
  • 1
  • 28
  • 36
  • check /etc/resolv.conf.. also ELB security groups. – Abdennour TOUMI Nov 13 '17 at 20:05
  • 1
    @AbdennourTOUMI Network Load Balancers doesn't have security groups (yes it's a shocker). As for a name resolution trick, it would not help considering the target port is the one from the load balancer, not the one of the actual service. – Laurent Jalbert Simard Nov 13 '17 at 20:17

2 Answers2

7

I experienced a similar issue.

Here is my setup:

  • A VPC spread over 3 AZ.
  • 3 public subnets (one in each AZ)
  • 1 instance in a public subnet in AZ-a
  • 3 private subnets (one in each AZ)
  • 1 NLB spread over the 3 private subnets.
  • A cluster of ECS instances. 1 instance in each private subnet. (instance-a in AZ-a, instance-b in AZ-b, instance-c in AZ-c)
  • A service running on each instance ; in total 3 healthy services spread over the 3 private subnets registered to the NLB.
  • A route 53 Alias record to map "myservice.example.com" to the NLB DNS name.

Below the tests executed:

Query initiated from an instance in the private subnet."

Test1: From instance-a (in AZ-a), query "myservice.example.com".

Result1: The query hits the NLB on one of its private IP. If the IP is in the same subnet as instance-a, the query will time-out. If the IP is in a different subnet, the query will succeed.

Test2: Same as Test1 but query from instance-b (in AZ-b).

Result2: The query hits the NLB on one of its private IP. If the IP is in the same subnet as instance-b, the query will time-out. If the IP is in a different subnet, the query will succeed.

Similar result with a query initiated from instance-c.

Query initiated from an instance in a public subnet AZ-a

Test3: From the instance in public subnet in AZ-a, query "myservice.example.com".

Result3: The query hits the NLB on one of its private IP. The query always succeeds, regardless of which private IP was hit.

Query initiated from an extra instance (instance-a2) in private subnet AZ-a

Test4: I have launched an additional instance (instance-a2) in the private subnet in AZ-a. Then, from instance-a2, query to "myservice.example.com". IMPORTANT: This instance does not run any service an therefore can never be selected by the NLB to route any request.

Result4: The query succeeds all the time! Even when hitting a target that is in the private subnet A (same subnet as instance-a2).

Conclusions:

  • With Test1 and Test2, I could experience the same issue as Laurent Jalber Simard when querying from an instance that was hosting the target service.
  • Per as Test3, the issue does not seem to come from requests coming from the same AZ as the target service.
  • With Test4, it appears that the issue cannot be reproduced if the query comes from an instance that is different from the instance hosting the target service ; even if they are in the same subnet.

Therefore, my conclusion so far is that the NLB will timeout if the source ip of the request and the destination ip of the target selected by the NLB is the same.

I couldn't find this issue/limitation documented in AWS NLB docs and so far nothing comes up in a Google search. Is there anybody outhere reaching to the same conclusion?

4

Solution If you would like to keep containers on the same instance and use NLB you need to use "awsvpc" networkMode in your task definition and change target group type to "ip"(not by instance ID).

Explanation NLB doesn't support hairpinning of requests. When you register targets by instance ID, the source IP addresses of clients are preserved. When you try to connect to the NLB from the backend a loopback is created and this is not allowed by the NLB as the source and destination address is the same and the connection times out. If an instance is a client of an internal load balancer that is registered by instance ID, the connection succeeds only if the request is routed to a different instance.

Some extra info: https://aws.amazon.com/premiumsupport/knowledge-center/target-connection-fails-load-balancer/

James
  • 523
  • 4
  • 19
  • 1
    I was facing the siilar issue, https://aws.amazon.com/premiumsupport/knowledge-center/target-connection-fails-load-balancer/ this solves my problem – pbms May 03 '21 at 14:38