I experienced a similar issue.
Here is my setup:
- A VPC spread over 3 AZ.
- 3 public subnets (one in each AZ)
- 1 instance in a public subnet in AZ-a
- 3 private subnets (one in each AZ)
- 1 NLB spread over the 3 private subnets.
- A cluster of ECS instances. 1 instance in each private subnet. (instance-a in AZ-a, instance-b in AZ-b, instance-c in AZ-c)
- A service running on each instance ; in total 3 healthy services spread over the 3 private subnets registered to the NLB.
- A route 53 Alias record to map "myservice.example.com" to the NLB DNS name.
Below the tests executed:
Query initiated from an instance in the private subnet."
Test1: From instance-a (in AZ-a), query "myservice.example.com".
Result1: The query hits the NLB on one of its private IP. If the IP is in the same subnet as instance-a, the query will time-out. If the IP is in a different subnet, the query will succeed.
Test2: Same as Test1 but query from instance-b (in AZ-b).
Result2: The query hits the NLB on one of its private IP. If the IP is in the same subnet as instance-b, the query will time-out. If the IP is in a different subnet, the query will succeed.
Similar result with a query initiated from instance-c.
Query initiated from an instance in a public subnet AZ-a
Test3: From the instance in public subnet in AZ-a, query "myservice.example.com".
Result3: The query hits the NLB on one of its private IP. The query always succeeds, regardless of which private IP was hit.
Query initiated from an extra instance (instance-a2) in private subnet AZ-a
Test4: I have launched an additional instance (instance-a2) in the private subnet in AZ-a. Then, from instance-a2, query to "myservice.example.com". IMPORTANT: This instance does not run any service an therefore can never be selected by the NLB to route any request.
Result4: The query succeeds all the time! Even when hitting a target that is in the private subnet A (same subnet as instance-a2).
Conclusions:
- With Test1 and Test2, I could experience the same issue as Laurent Jalber Simard when querying from an instance that was hosting the target service.
- Per as Test3, the issue does not seem to come from requests coming from the same AZ as the target service.
- With Test4, it appears that the issue cannot be reproduced if the query comes from an instance that is different from the instance hosting the target service ; even if they are in the same subnet.
Therefore, my conclusion so far is that the NLB will timeout if the source ip of the request and the destination ip of the target selected by the NLB is the same.
I couldn't find this issue/limitation documented in AWS NLB docs and so far nothing comes up in a Google search.
Is there anybody outhere reaching to the same conclusion?