AWS ECS Iptables allow source and destination to be the same ip address

Question

Currently, with AWS ECS combined with an internal NLB it is impossible to have inter-system communication. Meaning container 1 (on instance 1) -> internal NLB -> container 2 (on instance 1). Because the source IP address does not change and stays the same as the destination address the ECS instance drops this traffic.

I found a thread on the AWS forums here https://forums.aws.amazon.com/message.jspa?messageID=806936#806936 explaining my problem.

I've contacted AWS Support and they stated to have a fix on their roadmap but they cannot tell me when this will be fixed so I am looking into ways to solve it on my own until AWS have fixed it permanently.

It must be fixable by altering the ECS iptables but I have not enough knowledge to completely read their iptables setup and to understand what needs to be changed to fix this.

iptabels-save output:

:DOCKER - [0:0]
:DOCKER-ISOLATION - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER -d 172.17.0.3/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT
-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT
-A DOCKER -d 172.17.0.5/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 8086 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Wed Jan 31 22:19:47 2018
# Generated by iptables-save v1.4.18 on Wed Jan 31 22:19:47 2018
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [38:2974]
:POSTROUTING ACCEPT [7147:429514]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A PREROUTING -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 127.0.0.1:51679
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -d 169.254.170.2/32 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.17.0.3/32 -d 172.17.0.3/32 -p tcp -m tcp --dport 5000 -j MASQUERADE
-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE
-A POSTROUTING -s 172.17.0.5/32 -d 172.17.0.5/32 -p tcp -m tcp --dport 8086 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 32769 -j DNAT --to-destination 172.17.0.3:5000
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 32777 -j DNAT --to-destination 172.17.0.2:5000
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 32792 -j DNAT --to-destination 172.17.0.5:8086
COMMIT
# Completed on Wed Jan 31 22:19:47 2018

ip a:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:b4:86:0b:c0:c4 brd ff:ff:ff:ff:ff:ff
    inet 10.12.80.181/26 brd 10.12.80.191 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::8b4:86ff:fe0b:c0c4/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ca:cf:36:ae brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:caff:fecf:36ae/64 scope link
       valid_lft forever preferred_lft forever
7: vethbd1da82@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 36:6d:d6:bd:d5:d8 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::346d:d6ff:febd:d5d8/64 scope link
       valid_lft forever preferred_lft forever
27: vethc65a98f@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether e6:cf:79:d4:aa:7a brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::e4cf:79ff:fed4:aa7a/64 scope link
       valid_lft forever preferred_lft forever
57: veth714e7ab@if56: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 1e:c2:a5:02:f6:ee brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::1cc2:a5ff:fe02:f6ee/64 scope link
       valid_lft forever preferred_lft forever

score 2 · Accepted Answer · edited May 23 '18 at 20:16

I have no information about upcoming solutions, but I suspect any workaround will involve preventing an instance from connecting to itself and instead always connect to a different instance... or perhaps use the balancer's source address for hairpinned connections instead of the originating address.

The fundamental problem is this: the balancer works by integrating with the network infrastructure, and doing network address translation, altering the original target address on the way out, and the source address on the way back in, so that the instance in the target group sees the real source address of the client side, but not the other way around... but this is not compatible with asymmetric routing. When the instance ends up talking to itself, the route is quite asymmetric.

Assume the balancer is 172.30.1.100 and the instance is 172.30.2.200.

A TCP connection is initiated from 172.30.2.200 (instance) to 172.30.1.100 (balancer). The ports are not really important, but let's assume the source port is 49152 (ephemeral) and the balancer target port is 80 and the instance target port is 8080.

172.30.2.200:49152 > 172.30.1.100:80 SYN

The NLB is a NAT device, so this is translated:

172.30.2.200:49152 > 172.30.2.200:8080 SYN

This is sent back to the instance.

This already doesn't make sense, because the instance just got an incoming request from itself, from something external, even though it didn't make that request.

Assuming it responds, rather than dropping what is already a nonsense packet, now you have this:

172.30.2.200:8080 > 172.30.2.200:49152 SYN+ACK

If 172.30.2.200:49152 had actually sent a packet to 172.20.2.200:8080 it would respond with an ACK and the connection would be established.

But it didn't.

The next thing that happens should be something like this:

172.30.2.200:49152 > 172.30.2.200:8080 RST

Meanwhile, 172.30.2.200:49152 has heard nothing back from 172.30.1.100:80, so it will retry and then eventually give up: Connection timed out.

When the source and destination machine are different, NLB works because it's not a real (virtual) machine like those provided by ELB/ALB—it is something done by the network itself. That is the only possible explanation because those packets with translated addresses otherwise do make it back to the original machine with the NAT occurring in the reverse direction, and that could only happen if the VPC network were keeping state tables of these connections and translating them.

Note that in VPC, the default gateway isn't real. In fact, the subnets aren't real. The Ethernet network isn't real. (And none of this is a criticism. There's some utterly brilliant engineering in evidence here.) All of it is emulated by the software in the VPC network infrastructure. When two machines on the same subnet talk to each other directly... well, they don't.¹ They are talking over a software-defined network. As such, the network can see these packets and do the translation required by NLB, even when machines are on the same subnet.

But not when a machine is talking to itself, because when that happens, the traffic never appears on the wire—it remains inside the single VM, out of the reach of the VPC network infrastructure.

I don't believe an instance-based workaround is possible.

¹ They don't. A very interesting illustration of this is to monitor traffic on two instances on the same subnet with Wireshark. Open the security groups, then ping one instance from the other. The source machine sends an ARP request and appears to get an ARP response from the target... but there's no evidence of this ARP interaction on the target. That's because it doesn't happen. The network handles the ARP response for the target instance. This is part of the reason why it isn't possible to spoof one instance from another—packets that are forged are not forwarded by the network, because they are clearly not valid, and the network knows it. After that ARP occurs, the ping is normal. The traffic appears to go directly from instance to instance, based on the layer 2 headers, but that is not what actually occurs.

AWS ECS Iptables allow source and destination to be the same ip address

1 Answers1