2

we are testing out the Ambassador Edge Stack and started with a brand new GKE private cluster in autopilot mode.

We installed from scratch following the quick start tour to get a feeling of it and ended up with the following error

Error from server: error when creating "mapping-test.yaml": conversion webhook for getambassador.io/v3alpha1, Kind=Mapping failed: Post "https://emissary-apiext.emissary-system.svc:443/webhooks/crd-convert?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We did a few rounds of DNS testing and deployed a few different test pods in different namespaces to validate that kube-dns is working properly, everything looks good at that end. Also the resolv.conf looks good.

Ambassador is using the hostname emissary-apiext.emissary-system.svc:443 (without the cluster.local) which should resolve fine. Doing a lookup with the FQN (with cluster.local) works fine btw.

Any clues?

Thanks a lot and take care.

Sebastian
  • 23
  • 3

2 Answers2

5

I think i found the solution, posting here if someone come across this later on.

So i followed this to deploy Ambassador Edge Stack in a Autopilot private cluster. I was getting the same error when i was trying to deploy the Mapping object (step 2.2).

The issue is that the control plane (API Server) is trying to call emissary-apiext.emissary-system.svc:443 but the pods behind it are listening on port 8443 (figured that out by describing the Service).

So i added a firewall rule to allow the GKE control plane to talk to the nodes on port 443.

The firewall rule in question is called gke-gke-ap-xxxxx-master. The xxxx is called the cluster hash and is different for each cluster. To make sure you are editing the proper rule, double check that source IP Range matches the "Control plane address range" from the cluster details page. And that it's the rule that has a name ending with master.

Just edit that rule and add 8443 to the tcp ports. It should work

boredabdel
  • 1,732
  • 3
  • 7
0

That sounds like an issue related to the webhooks limitation in GKE Autopilot

Which version of GKE are you on ?

Also there is a limitation with which resources and namespaces we allow webhooks to intercept

Additionally, webhooks which specify one or more of following resources (and any of their sub-resources) in the rules, will be rejected:

  • group: "" resource: nodes
  • group: "" resource: persistentvolumes
  • group: certificates.k8s.io resource: certificatesigningrequests
  • group: authentication.k8s.io resource: tokenreviews

You probably have to check the manifests of Ambassador Edge Stack to figure this out.

boredabdel
  • 1,732
  • 3
  • 7
  • Thanks @botrfsbel, never heard about their limitations to be honest. Version is `1.21.5-gke.1302` will also check the manifest to get more details... Come back to you once I've checked it. – Sebastian Jan 05 '22 at 17:58
  • Hey @boredabdel, just did a quick test with a new standard cluster and it worked out well. I'm not super familiar with CRDs and did a quick search in their manifest `https://app.getambassador.io/yaml/edge-stack/latest/aes-crds.yaml` for those api-groups without any luck. Any further ideas? :) – Sebastian Jan 05 '22 at 18:28
  • Hey @boredabdel, seems you're right – at least I did a successful install on a classic cluster. I also had contact with the ambassador dev team and it seems that they aren't aware of that issue. I will file a bug with them as soon as I know what is going on there. – Sebastian Jan 25 '22 at 21:08