0

I am developing a product feature which allows a user to publish a VIP from our product to their own infrastructure via BGP. To achieve this we run FRR on all control plane servers. FRR is configured to peer with the user's upstream router(s) as appropriate. The VIP is published via anycast and we expect the upstream router(s) to balance traffic across all servers with ECMP. This all works fine.

However, I am looking for a way to detect a hypothetical BGP misconfiguration during a rolling upgrade of the control plane. In this scenario we have 3 servers: A, B, and C. They are all running a service. They all have the service VIP 1.2.3.4 attached to a local dummy interface on the server. They are all publishing the VIP upstream. Upstream routers are balancing requests to 1.2.3.4 across all 3 servers.

A hypothetical configuration change is introduced which doesn't affect the service, but prevents it from being successfully published to the upstream router.

The change is applied on server A. Health checks on server A succeed. Running from server A, health checks to 1.2.3.4 are handled by the local service on the local interface. External health checks also succeed, but only because the service is still available on servers B, and C. Upstream routers are no longer routing traffic to server A, but we don't detect this.

With all health checks passed, we roll out the configuration change to server B, and then to server C. However, as soon as we apply the configuration to server C we remove the last upstream route and we experience an outage.

I would like to detect the failure when we applied it to server A and halt the rollout before it results in an outage. Bear in mind that we have no control over the make or configuration of the upstream routers other than we require they're configured to accept our routes. Is there any protocol which would allow me to independently verify that an upstream router has my route? Alternatively, is there any well-tested pattern for detecting failures like this in active-active services?

Matthew Booth
  • 121
  • 2
  • 6

1 Answers1

0

On a Juniper router, BGP sends routes back to the peer that sent it to her in the first place (that peer will of course not install it in the routing table, since to do so could create a routing loop). So a policy triggered log could be used in that case.

On a Cisco router, BGP does not send routes back to the peer that sent it to her in the first place. So, instead, SNMP (or some other protocol in the management plane) could be used in that case.

Andrew
  • 1
  • 4
  • 19