GKE NEG Readiness Gate Failing with Windows Containers and Readiness Probe

Question

I'm running into an issue:

Getting a health check to succeed for a .Net app running in an IIS Container when trying to use Container Native Load Balancing(CNLB).

I have a Network Endpoint Group(NEG) created by an Ingress resource definition in GKE with a VPC Native Cluster.

When I circumvent CNLB by either exposing the NodePort or making a service of type LoadBalancer, the site resolves without issue.

All the pod conditions from a describe look good: pod readiness

The network endpoints show up when running describe endpoints: ready addresses

This is the health check that is generated by the load balancer: GCP Health Check

When hitting these endpoints from other containers or VMs in the same VPC, /health.htm responds with a 200. Here's from a container in the same namespace, though I have reproduced this with a Linux VM, not in the cluster but in the same VPC: endpoint responds

But in spite of it all, the health check is reporting the pods in my NEG unhealthy: Unhealthy Endpoints

The stackdriver logs confirm the requests are timing out but I'm not sure why when the endpoints are responding to other instances but not the LB: Stackdriver Health Check Log

And I confirmed that GKE created what looks like the correct firewall rule that should allow traffic from the LB to the pods: firewall

Here is the YAML I'm working with:

Deployment:

apiVersion: apps/v1                                                  
kind: Deployment                                                     
metadata:                                                            
  labels:                                                            
    app: subdomain.domain.tld                                       
  name: subdomain-domain-tld                                       
  namespace: subdomain-domain-tld
spec:                                                                
  replicas: 3                                                        
  selector:                                                          
    matchLabels:                                                     
      app: subdomain.domain.tld                                     
  template:                                                          
    metadata:                                                        
      labels:                                                        
        app: subdomain.domain.tld
    spec:                                                            
      containers:                                                    
      - image: gcr.io/ourrepo/ourimage
        name: subdomain-domain-tld
        ports:                                                       
        - containerPort: 80                                          
        readinessProbe:                                              
          httpGet:                                                   
            path: /health.htm                                        
            port: 80                                                 
          initialDelaySeconds: 60                                    
          periodSeconds: 60                                          
          timeoutSeconds: 10                                         
        volumeMounts:                                                
        - mountPath: C:\some-secrets                                      
          name: some-secrets
      nodeSelector:                                                  
        kubernetes.io/os: windows                                    
      volumes:                                                       
      - name: some-secrets                                    
        secret:                                                      
          secretName: some-secrets

Service:

apiVersion: v1                                                       
kind: Service                                                        
metadata:                                                            
  labels:                                                            
    app: subdomain.domain.tld                                     
  name: subdomain-domain-tld-service
  namespace: subdomain-domain-tld
spec:                                                                
  ports:                                                             
  - port: 80                                                         
    targetPort: 80                                                   
  selector:                                                          
    app: subdomain.domain.tld                                       
  type: NodePort

Ingress is extremely basic as we have no real need for multiple routes on this site, however, I'm suspecting whatever issues we're having are here.

apiVersion: extensions/v1beta1                                       
kind: Ingress                                                        
metadata:                                                            
  annotations:                                                       
    kubernetes.io/ingress.class: gce
  labels:                                                            
    app: subdomain.domain.tld                                       
  name: subdomain-domain-tld-ingress
  namespace: subdomain-domain-tld
spec:                                                                
  backend:                                                           
    serviceName: subdomain-domain-tld-service
    servicePort: 80

Last somewhat relevant detail is I tried the steps present in this documentation and it worked but it's not identical to my situation as its not using Windows Containers nor Readiness Probes: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#using-pod-readiness-feedback

Any suggestions would be greatly appreciated. I've spent two days stuck on this and I'm sure it's obvious but I just can't see the problem.

if it possible to switch to linux container ? If so, we can give you solution — Abdennour TOUMI, Jul 14 '20 at 23:23
Are you allowing ingress/egress everywhere? All firewalls and Kubernetes network policies? Also allowing both on the cluster and to/from the load-balancer? — Rico, Jul 15 '20 at 01:39
Unfortunately I can't switch to a linux container as the app we're running is asp.net rather than .net core and we're unable to port it to .net core @AbdennourTOUMI — 210rain, Jul 15 '20 at 15:23
@Rico Yes, the cluster its on is used purely for looking into the feasibility of running our asp.net sites in GKE so I haven't configured any network policies. I've allowed all traffic on all ports to any instance in my VPC from 35.191.0.0/16 and 130.211.0.0/22 which are the IP ranges Google Load Balancers send traffic from per the documentation on this page: https://cloud.google.com/load-balancing/docs/health-checks I can also confirm there are no other firewall rules that would be taking over priority and denying the traffic. — 210rain, Jul 15 '20 at 16:07
Must be some firewall rule somewhere. You can always check with GKE support. — Rico, Jul 15 '20 at 16:11

score 2 · Answer 1 · answered Jul 28 '20 at 19:35

Apparently it's not documented but this functionality doesn't work with Windows containers at the time of writing. I was able to get in touch with a GCP Engineer and they provided the following:

After further investigation, I have found that Windows containers using LoadBalancer service works but, Windows containers using Ingress with NEGS is a limitation so, I have opened an internal case for updating the public documentation [1].

Since, Ingress + NEG will not work (per the limitation), I suggest you to use any option you mentioned either exposing the NodePort or making a service of type LoadBalancer.

score 0 · Answer 2 · answered Jun 15 '21 at 08:17

When you create an Ingress, the generated HC probes will default to performing HealthCheck on the same serving port and Path as the app. in this case, port 80 on Path /

Seems like your app report it's healthCheck on port 80 but on the /health.htm path.

You will need to add a custom healthCheck via the BackendConfig CRD. Have a look at this link [1]. You can find in the same Page how to associate the BackendConfig to the Ingress

What version of GKE are you on? Seems like an old version judging from the Ingress API you use.

[1]https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features#direct_health

score 0 · Answer 3 · answered Mar 16 '22 at 05:54

You can refer to this GCP document. Note: This feature is not supported with Windows Server node pools.

Feature limitations There are some Kubernetes features that are not yet supported for Windows Server containers. In addition, some features are Linux-specific and do not work for Windows. For the complete list of supported and unsupported Kubernetes features, see the Kubernetes documentation.

In addition to the unsupported Kubernetes features, there are some GKE features that are not supported.

For GKE clusters, the following features are not supported with Windows Server node pools:

Cloud TPUs (--enable-tpu) Image streaming Ingress with Network Endpoint Groups Intranode visibility (--enable-intra-node-visibility) IP masquerade agent Kubernetes alpha cluster (--enable-kubernetes-alpha) Node Local DNS cache Private use of Class E IP addresses Private use of public IP addresses Network policy logging Kubernetes service.spec.sessionAffinity Spot VMs GPUs (--accelerator)

https://cloud.google.com/kubernetes-engine/docs/concepts/windows-server-gke https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#container-native_load_balancing

GKE NEG Readiness Gate Failing with Windows Containers and Readiness Probe

3 Answers3