VPN between on prem and GCP: routes shared but ping doesn't go through

Question

I have been struggling with the VPN setup between on-prem and GCP for more than a week. I am completely out of ideas at this point, and would love to get some help of network specialists.

Goal

The end goal is simple: to get a VM instance on GCP to seamlessly talk to a VM on-prem - but with 2 routers in play.
The setup is something like below:

       GCP_VM                                                           OP_VM
    10.0.0.25                                                    10.100.0.200
            |                                                    |
            |                                           (DC Router Gateway)
            |                                               10.100.0.80
            |                                                    |
            └-- HA_VPN (AS65001) <==========> Router (AS65002) --┘

     Public IP: xx.xx.xx.xx                   yy.yy.yy.yy
     Advertise: 10.0.0.0/24 BGP               10.100.0.0/24 BGP
  VPN IP Range: 169.254.0.1/30                169.254.0.2 (as Peer)
    Private IP: NA                            10.100.0.50

The complication here is that Router here is not directly connected to OP_VM. This is the on-prem setup we have no control over. OP_VM gets its IP 10.100.0.200 from some other router, and our Router is put on to the same LAN. We only get a single rack in the data centre, and need to reach OP_VM which is hosted by other party (in some other rack). Our rack is associated with 10.100.0.50.

And with this, I want to be able to get the below work:

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200

Current Status

With the above setup, VPN and BGP seem healthy from the logs on both sides.

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.50
PING 10.100.0.50 (10.100.0.50) 56(84) bytes of data.
64 bytes from 10.100.0.50: icmp_seq=1 ttl=254 time=24.9 ms
...

Also, from Router, I could confirm I can ping 10.100.0.200 (OP_VM).

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.80

root@Router:10.100.0.50:~$ ping 10.100.0.200
ping 10.100.0.200
received from 10.100.0.200: icmp_seq=0 ttl=63 time=0.583ms
received from 10.100.0.200: icmp_seq=1 ttl=63 time=0.571ms

2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max = 0.571/0.577/0.583 ms

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.80

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
^C
--- 10.100.0.200 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3051ms

I'm probably misunderstanding the gateway setup, but changing the route like below gives me a different result:

# With the Router setup of something like
#
#     ip route 10.100.0.0/24 gateway 10.100.0.50
#                                             ~~ <- Router itself

me@GCP_VM:10.0.0.25:~$ ping 10.100.0.200
PING 10.100.0.200 (10.100.0.200) 56(84) bytes of data.
From 169.254.0.2 icmp_seq=7 Destination Host Unreachable
From 169.254.0.2 icmp_seq=6 Destination Host Unreachable
From 169.254.0.2 icmp_seq=5 Destination Host Unreachable
From 169.254.0.2 icmp_seq=4 Destination Host Unreachable
From 169.254.0.2 icmp_seq=3 Destination Host Unreachable
From 169.254.0.2 icmp_seq=2 Destination Host Unreachable
From 169.254.0.2 icmp_seq=1 Destination Host Unreachable
^C
--- 10.100.0.200 ping statistics ---
9 packets transmitted, 0 received, +7 errors, 100% packet loss, time 8141ms
pipe 7

With this gateway setup, Router can no longer ping OP_VM. This at least seems to me that VPN is established and IP is advertised correctly. But this does not look right from the actual networking point of view.

Questions

I don't think there is much more to be done on GCP side, and the issue seems to be purely on the on-prem.

Is there any setup issues, or concerns that may cause misbehaviour of VPN, BGP, ARP, etc.? What would cause such a case where routes seem to be shared, but cannot actually access them?

Other Notes

I have confirmed the ARP table on Router includes 10.100.0.200
I can see the routes propagated in GCP
I have tested with GCP VPC's Firewall setup to allow 169.254.0.0/30 and 10.100.0.0/24
I will need access from GKE in the end, but I have confirmed GKE is getting the same exact behaviour as GCP_VM
Router is from Yamaha
Tried TCPdump (packetdump in Yamaha routers), but did not see 10.0.0.25 in the log
TCPdump did show the trace of 10.0.0.25 when I ran nmap -Pn 10.100.0.200 from GCP_VM, but with single line like this:

2019/12/21 16:35:40: LAN1 OUT:IP TCP 10.100.0.227:50516 > 10.103.24.1:80

Update (24th Dec)

I have done tcpdump for simple ping between GCP_VM and Router.

From GCP_VM to Router (logs from GCP_VM)

$ ping 10.100.0.50 > /dev/null &
$ sudo tcpdump -i eth0 | grep 10.100
...
18:49:18.696178 IP GCP_VM.(snip) > 10.100.0.50: ICMP echo request
, id 32396, seq 0, length 64
18:49:18.700395 IP 10.100.0.50 > GCP_VM.(snip): ICMP echo reply, 
id 32396, seq 0, length 64

From Router to GCP_VM (logs from GCP_VM)

# ping from Router, with `ping 10.0.0.25`
$ sudo tcpdump -i eth0 | grep 169.254
...
18:40:18.554555 IP 169.254.0.2 > GCP_VM.(snip): ICMP echo request,
 id 3369, seq 0, length 72
18:40:18.554586 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo reply, i
d 3369, seq 0, length 72

Although tcpdump shows the reply is being sent here, it is never received by Router.
Also, ping to 169.254.0.2 from GCP_VM gets no reply.

$ ping 169.254.0.2 > /dev/null &
$ sudo tcpdump -i eth0 | grep 169.254
...
18:59:07.113101 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 0, length 64
18:59:08.137103 IP GCP_VM.(snip) > 169.254.0.2: ICMP echo request, i
d 32531, seq 1, length 64
...

Update (27th Dec)

Ping from the Router was successful after setting its source address to 10.100.0.50, as it was trying to use 169.254.0.2 by default.

The ping still doesn't reach OP_VM, and I'm still facing NAT configuration issue to ensure the translation goes correctly.

Update (31st Dec)

The connection has been finally set up. I'll be summarising the steps taken in a separate answer to declutter the question.

Please add your solution as you mentioned in your last update. — Serhii Rohoza, Jan 13 '20 at 22:15
I have added further clarification and solution I had to put in place in an answer below. — Ryota, Jan 15 '20 at 13:06

Serhii Rohoza · Answer 1 · 2019-12-24T22:20:43.007

2

It's looks like a routing problem on-prem. I think, OP_VM doesn't have a route to 10.0.0.0/24 and as result send it to the default gateway DC Router Gateway and there it's dropped because DC Router Gateway (10.100.0.80) also doesn't have route to 10.0.0.0/24 (because you have peering at Router).

To solve it you should set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

You have to remove route ip route 10.100.0.0/24 gateway 10.100.0.50 from Router- network 10.100.0.0/24 is directly connected to him.

EDIT

From GCP_VM, I can ping 10.100.0.50 (Router) successfully.

At this point it looks like you have properly configured peering between Router and HA_VPN.

You should be able to ping GCP_VM and OP_VM from Router and also Router from OP_VM to be on a right path.

With the Router setup of something like
 ip route 10.100.0.0/24 gateway 10.100.0.80
With the Router setup of something like
 ip route 10.100.0.0/24 gateway 10.100.0.80

You don't need these routes because Router is directly connected to subnet 10.100.0.0/24 and has an IP 10.100.0.50

From GCP_VM, though, ping to 10.100.0.200 (OP_VM) goes missing.

It's expected because OP_VM and DC Router Gateway don't have a route to 10.0.0.0/24 as I mentioned above and can't reply and you have to set a static route at OP_VM to 10.0.0.0/24 via Router and keep DC Router Gateway as a default gateway.

EDIT 2 OP_VM sent replies to DC Router Gateway because it's doesn't have a route to 10.100.0.0/24 and it try to reach it via default gateway, and at DC Router Gateway they've dropped because there's no route also.

You should add a static route at OP_VM or at DC Router Gateway to 10.100.0.0/24 to solve it.

edited Dec 24 '19 at 22:20

answered Dec 23 '19 at 17:11

Serhii Rohoza

1,424
2
5
15

Thanks for the insight, I'm suspecting that scenario as well. `10.0.0.0/24` is advertised to the `Router` via BGP, and that should be sufficient for the routing. However, I am seeing the ping from `Router` to `10.0.0.25` failing, although the other way around works. There seems to be something missing in the `Router`, or potentially GCP FW setup. – Ryota Dec 24 '19 at 01:58
Firewall at GCP should accept on-prem network `10.100.0.0/24` you advertise via BGP (must be in the routing table at `Router` and after peering at`HA_VPN`) and vise verse firewall on-prem should accept cloud network `10.0.0.0/24` (network `10.0.0.0/24` must be in the routing table of `HA_VPN` and after peering at `Router`). You shoud be able to ping everything in cloud from `Router` and vise versa. What's the default gateway of yor `Router`? – Serhii Rohoza Dec 24 '19 at 06:05
1

FW is set on GCP to allow the traffic to/from `10.100.0.0/24`, and on-prem has the same for `10.0.0.0/24`. I have done the tcpdump on both ends, and it does look like the ICMP going through both routes and passing the FW correctly. I have updated the original question with some more details. But now it looks like the route from GCP to `169.254.0.2` isn't going through. At this point, I'm not sure if this has to do with the main issue of `GCP_VM` not being able to reach `10.100.0.200` though... – Ryota Dec 24 '19 at 19:07
If firewalls are configured properly check the routing tables at `Router` and `HA_VPN`. Why did you decided to use link-local IP addresses for peering? – Serhii Rohoza Dec 24 '19 at 21:18
Both routing tables show each other (GCP side with VPC Routes, `Router` with `ip route`). I used the link-local IP addresses by following the [GCP doc for HA VPN setup](https://cloud.google.com/vpn/docs/how-to/creating-ha-vpn). At this point, my hypothesis is that `OP_VM` (the one I don't have control over) rejects incoming packet from `10.0.0.25`, and ping from `Router` to `GCP_VM` is a separate problem (probably with GCP routes not understanding the link-local IP). As to the former problem, I probably need to NAT to use `10.100.0.0/24` range. I'm not sure about the latter. – Ryota Dec 24 '19 at 21:38
I think that `OP_VM` sent replies to `DC Router Gateway` because it's doesn't have a route to `10.100.0.0/24` and it try to reach it via default gateway, and at `DC Router Gateway` they've dropped because there's no route also. You should add a static route at `OP_VM` or at `DC Router Gateway` to `10.100.0.0/24` to solve it. – Serhii Rohoza Dec 24 '19 at 21:47
1

That may well be the case. Neither `OP_VM` nor `DC Router Gateway` are under my control, so I will need to have a third party vendor to look into them. While I wait for the help from that end, I will see if NAT is going to help. – Ryota Dec 24 '19 at 21:54
Well, NAT at `Router` could solve the problem, but static route, in my opinion, is the best way to do it. – Serhii Rohoza Dec 24 '19 at 22:16
Did you solve the problem? Please mark my post as "accepted" if my solution was helpful. – Serhii Rohoza Dec 27 '19 at 22:06
I have been working on this day in day out, but still haven't solved the problem yet. There is still a couple of pieces not working around NAT and routing, which is still a mystery. Your support was certainly helpful and I appreciate it, but I will hold off from accepting answer until the problem is fully resolved, as I'm also desperate for a proper solution. – Ryota Dec 27 '19 at 23:39

score 1 · Accepted Answer · answered Jan 15 '20 at 13:05

After much testing and debugging, I have resolved the connection issue between GCP and on-prem. The below is the steps taken, and also considerations made while pinpointing the problem.

Analyse Traffic from Both Directions

I was lacking the consideration to dissect the traffic from both directions. This means breaking down how each packet would travel from/to source/destination, and that would give clear view of where the root cause could lie.

Packet from GCP to on-prem

Packet sent from GCP_VM (10.0.0.25) to HA_VPN (169.254.0.1/30)
Packet sent from GCP_VM (10.0.0.25) to Router (AS65002) (10.100.0.50)
Packet sent from GCP_VM (10.0.0.25) to DC Router Gateway (10.100.0.80)
Packet sent from GCP_VM (10.0.0.25) to OP_VM (10.100.0.200)
Packet sent from Router (AS65002) (10.100.0.50) to OP_VM (10.100.0.200)

Packets from on-prem to GCP (return route)

Packet sent from OP_VM (10.100.0.200) to Router (AS65002) (10.100.0.50)
Packet sent from OP_VM (10.100.0.200) to GCP_VM (10.0.0.25)
Packet sent from Router (AS65002) (10.100.0.50) to GCP_VM (10.0.0.25)

Given the above checkpoints, the followings were the status:

This was not tested, as Cloud VPN endpoint (169.254.0.1/30) was not part of routes in VPC
I could confirm the ping hitting Router (AS65002) (10.100.0.50), and also response returned (this means the corresponding #9 is also confirmed)
I could NOT confirm the ping hitting DC Router Gateway (10.100.0.80), as ping did not receive response
I could NOT confirm the ping hitting OP_VM (10.100.0.200), as ping did not receive response
I could confirm the ping hitting OP_VM (10.100.0.200), and also response returned (this means the corresponding #7 is also confirmed)
As mentioned, #6 confirmed this traffic as well
No traffic matches this case
As mentioned, #2 confirmed this traffic as well

The below is the diagram to describe the situation

    GCP_VM     HA_VPN (AS65001)     Router (AS65002)   (DC Router Gateway)    OP_VM
 10.0.0.25                          10.100.0.50        10.100.0.80            10.100.0.200

1. NA
2.    +--------------------------------> OK (response returned)
3.    +----------------------------------------------------x NG?
4.    +---------------------------------------------------------------------------x NG?
5.                                      +-----------------------------------------> OK (response returned)
6.                                   OK <-----------------------------------------+
7. No matching traffic
8. OK <--------------------------------+

This clarifies that any traffic initiating from GCP_VM leaves possible 2 issues:

Possibility A. Packet is not reaching OP_VM (10.100.0.200)
Possibility B. Packet is reaching OP_VM (10.100.0.200), but response is not getting back to Router (AS65002) (10.100.0.50)

I could confirm with #5 and #6 above that the packet does reach OP_VM (10.100.0.200) when initiated at Router (AS65002) (10.100.0.50). That means, Possibility A is unlikely. The routing itself is working as it should.

This meant there is a high chance that ping response is lost, and never hitting Router (AS65002) (10.100.0.50) back. And for my specific case here, this Possibility B was the root cause of the problem.

I could confirm this is the case by creating a mock network mimicking the same setup as above, and using Wireshark to listen at each point. It meant that the diagram below is the actual case.

    GCP_VM     HA_VPN (AS65001)     Router (AS65002)   (DC Router Gateway)    OP_VM
 10.0.0.25                          10.100.0.50        10.100.0.80            10.100.0.200

1. NA
2.    +--------------------------------> OK (response returned)
3.    +----------------------------------------------------> OK
4.    +---------------------------------------------------------------------------> OK
5.                                      +-----------------------------------------> OK (response returned)
6.                                   OK <-----------------------------------------+
7.               Where should I sent packet to??  LOST ---------------------------+
8. OK <--------------------------------+

Solution

In my case, when the traffic initiates at GCP_VM, the source IP was set to 10.0.0.25. This meant, when OP_VM tries to send the traffic back, it couldn't find where 10.0.0.25, and packet was lost.

I had to add a static NAT entry at Router (AS65002) to map source IP of 10.0.0.25 to 10.100.0.50 when the packet leaves Router (AS65002), so that the OP_VM can properly route the traffic back to Router (AS65002). After the response is received, NAT takes effect again, and Router (AS65002) then replaces 10.100.0.50 with 10.0.0.25, and sends packets back to GCP_VM.