I have been struggling with the VPN setup between on-prem and GCP for more than a week. I am completely out of ideas at this point, and would love to get some help of network specialists.
Goal
The end goal is simple: to get a VM instance on GCP to seamlessly talk to a VM on-prem - but with 2 routers in play.
The setup is something like below:
GCP_VM OP_VM
10.0.0.25 10.100.0.200
| |
| (DC Router Gateway)
| 10.100.0.80
| |
└-- HA_VPN (AS65001) <==========> Router (AS65002) --┘
Public IP: xx.xx.xx.xx yy.yy.yy.yy
Advertise: 10.0.0.0/24 BGP 10.100.0.0/24 BGP
VPN IP Range: 169.254.0.1/30 169.254.0.2 (as Peer)
Private IP: NA 10.100.0.50
The complication here is that Router
here is not directly connected to OP_VM
. This is the on-prem setup we have no control over. OP_VM
gets its IP 10.100.0.200
from some other router, and our Router
is put on to the same LAN. We only get a single rack in the data centre, and need to reach OP_VM
which is hosted by other party (in some other rack). Our rack is associated with 10.100.0.50
.
And with this, I want to be able to get the below work:
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
Current Status
With the above setup, VPN and BGP seem healthy from the logs on both sides.
From GCP_VM
, I can ping 10.100.0.50
(Router
) successfully.
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.50
PING 10.100.0.50 (10.100.0.50) 56(84) bytes of data.
64 bytes from 10.100.0.50: icmp_seq=1 ttl=254 time=24.9 ms
...
Also, from Router
, I could confirm I can ping 10.100.0.200
(OP_VM
).
# With the Router setup of something like
#
# ip route 10.100.0.0/24 gateway 10.100.0.80
root@Router:10.100.0.50:~$ ping 10.100.0.200
ping 10.100.0.200
received from 10.100.0.200: icmp_seq=0 ttl=63 time=0.583ms
received from 10.100.0.200: icmp_seq=1 ttl=63 time=0.571ms
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max = 0.571/0.577/0.583 ms
From GCP_VM
, though, ping to 10.100.0.200
(OP_VM
) goes missing.
# With the Router setup of something like
#
# ip route 10.100.0.0/24 gateway 10.100.0.80
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
^C
--- 10.100.0.200 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3051ms
I'm probably misunderstanding the gateway setup, but changing the route like below gives me a different result:
# With the Router setup of something like
#
# ip route 10.100.0.0/24 gateway 10.100.0.50
# ~~ <- Router itself
me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
From 169.254.0.2 icmp_seq=7 Destination Host Unreachable
From 169.254.0.2 icmp_seq=6 Destination Host Unreachable
From 169.254.0.2 icmp_seq=5 Destination Host Unreachable
From 169.254.0.2 icmp_seq=4 Destination Host Unreachable
From 169.254.0.2 icmp_seq=3 Destination Host Unreachable
From 169.254.0.2 icmp_seq=2 Destination Host Unreachable
From 169.254.0.2 icmp_seq=1 Destination Host Unreachable
^C
--- 10.100.0.200 ping statistics ---
9 packets transmitted, 0 received, +7 errors, 100% packet loss, time 8141ms
pipe 7
With this gateway setup, Router
can no longer ping OP_VM
. This at least seems to me that VPN is established and IP is advertised correctly. But this does not look right from the actual networking point of view.
Questions
I don't think there is much more to be done on GCP side, and the issue seems to be purely on the on-prem.
Is there any setup issues, or concerns that may cause misbehaviour of VPN, BGP, ARP, etc.? What would cause such a case where routes seem to be shared, but cannot actually access them?
Other Notes
- I have confirmed the ARP table on
Router
includes10.100.0.200
- I can see the routes propagated in GCP
- I have tested with GCP VPC's Firewall setup to allow
169.254.0.0/30
and10.100.0.0/24
- I will need access from GKE in the end, but I have confirmed GKE is getting the same exact behaviour as
GCP_VM
Router
is from Yamaha- Tried TCPdump (
packetdump
in Yamaha routers), but did not see10.0.0.25
in the log - TCPdump did show the trace of
10.0.0.25
when I rannmap -Pn 10.100.0.200
fromGCP_VM
, but with single line like this:
2019/12/21 16:35:40: LAN1 OUT:IP TCP 10.100.0.227:50516 > 10.103.24.1:80
Update (24th Dec)
I have done tcpdump
for simple ping between GCP_VM
and Router
.
From GCP_VM
to Router
(logs from GCP_VM
)
$ ping 10.100.0.50 > /dev/null &
$ sudo tcpdump -i eth0 | grep 10.100
...
18:49:18.696178 IP GCP_VM.(snip) > 10.100.0.50: ICMP echo request
, id 32396, seq 0, length 64
18:49:18.700395 IP 10.100.0.50 > GCP_VM.(snip): ICMP echo reply,
id 32396, seq 0, length 64
From Router
to GCP_VM
(logs from GCP_VM
)
# ping from Router, with `ping 10.0.0.25`
$ sudo tcpdump -i eth0 | grep 169.254
...
18:40:18.554555 IP 169.254.0.2 > GCP_VM.(snip): ICMP echo request,
id 3369, seq 0, length 72
18:40:18.554586 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo reply, i
d 3369, seq 0, length 72
Although tcpdump
shows the reply is being sent here, it is never received by Router
.
Also, ping to 169.254.0.2
from GCP_VM
gets no reply.
$ ping 169.254.0.2 > /dev/null &
$ sudo tcpdump -i eth0 | grep 169.254
...
18:59:07.113101 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 0, length 64
18:59:08.137103 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 1, length 64
...
Update (27th Dec)
Ping from the Router
was successful after setting its source address to 10.100.0.50
, as it was trying to use 169.254.0.2
by default.
The ping still doesn't reach OP_VM
, and I'm still facing NAT configuration issue to ensure the translation goes correctly.
Update (31st Dec)
The connection has been finally set up. I'll be summarising the steps taken in a separate answer to declutter the question.