Create a Amazon EKS cluster with jenkins-x and cluster-autoscaler gives fails ingress on even number of nodes

Question

I am creating an Amazon EKS cluster using jenkins-x with:

jx create cluster eks -n demo --node-type=t3.xlarge --nodes=1 --nodes-max=5 --nodes-min=1 --skip-installation

After that, I add the cluster-autoscaler IAM policy for auto discovery and the added tags on the autoscaling group and the created instance, according this guide.

I add the rbac roles for tiller and the autoscaler with this file (kubectl create -f rbac-config.yaml):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: tiller
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tiller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: tiller
    namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: autoscaler
    namespace: kube-system

I installed tiller:

helm init --service-account tiller

and installed the cluster autoscaler:

helm install stable/cluster-autoscaler -f cluster-autoscaler-values.yaml --name cluster-autoscaler --namespace kube-system

Then I install the jenkins-x system:

jx install --provider=eks --domain=mydomain.com --default-environment-prefix=demo --skip-setup-tiller

I just accept all the defaults on the questions (nginx-ingress is created for me).

Then I create a default spring-boot-rest-prometheus app:

jx create quickstart

again, accepting all the defaults. This works fine, the application is picked up by jenkins is compiled, which I can see in:

http://jenkins.jx.mydomain.com

and I can reach the app through:

http://spring-boot-rest-prometheus.jx-staging.mydomain.com

Then I run a test to see if the autoscaler is working correctly, so I open up the file in the charts/spring-boot-rest-prometheus/values.yaml and change replicaCount: 1 to replicaCount: 8. Commit and push. This kicks of the Jenkins pipeline and spins up a new node because the autoscaler sees that there are not enough cpu resources on the first node.

After the second node has come up, I cannot reach Jenkins and the app anymore via the domain names. So for some reason, my ingress is not working anymore.

I have played around with this a lot, and manually changing the desired number of nodes directly on EC2, and when there is an even number of nodes, the domains are not reachable and when there is an odd number of nodes the domains are reachable.

I do not think this is related to the autoscaler, because the scale up and the scale down are working fine, and the problem is also there if I manually change the desired nodes of the server.

What causes the ingress to fail for an even number of nodes? How can I investigate this issue further?

Logs and desriptors for all ingress parts are posted here.

Odd! Pun intended :) `kubectl describe`'ing your ingress to see what's going on out there would be a good start. — Clorichel, Dec 03 '18 at 18:08
Posted log from ingress controller and decribe of all parts of the ingress controller here: https://gist.github.com/martijnburger/1e842e6c1d018044e4f01f9054f9bfb6 — Martijn Burger, Dec 03 '18 at 22:38

score 1 · Answer 1 · answered Dec 03 '18 at 21:02

1

You can debug this by looking at the AWS ASG (AutoScaling Group) and the load balancer (ELB) target instances.

You can see that the instances are being added to the ASG:

Then you can see in your load balancer that the instances are in service:

It could be that some of the even number of instances are not in service. Do they happen to be in a different availability zone? Are the ones that are 'odd' numbers being removed from the ELB? is traffic not being forwarded to them?

answered Dec 03 '18 at 21:02

Rico

58,485
12
111
141

Ah! Triggerd by the question 'Do they happen to be in different availablity zone?' Yes. They absolutely are. Is that a problem for the ELB? – Martijn Burger Dec 03 '18 at 23:49
Here is how to enable cross zone load balancing. I'll look into it tomorrow if this is the problem. https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#cross-zone-load-balancing – Martijn Burger Dec 03 '18 at 23:55
Only it could be that traffic is not being forwarded to the ones in a different avzone, are they in service? It should be supported. Note that AWS charges an arm and a leg for traffic between different av zones. – Rico Dec 03 '18 at 23:56
I think I found it, just need to verify, but this bug report and open PR say it all I guess: https://github.com/kubernetes/ingress-nginx/issues/3254 https://github.com/kubernetes/kubernetes/pull/61064 – Martijn Burger Dec 04 '18 at 00:02
1

It's the nginx-ingress helm chart that is doing that for me. :) – Martijn Burger Dec 04 '18 at 00:08
1

Making the NLB multi-zon did not solve the problem. I do however see unhealthy targets after scaling to two nodes. After removing the ingress service and recreating it the targets get healthy again. But that needs an update of the CNAME in Route 53 also. I am getting closer, but still no white smoke. – Martijn Burger Dec 04 '18 at 14:00

score 0 · Accepted Answer · answered Dec 14 '18 at 19:45

0

FWIW, I seem to have run into this issue:

https://github.com/kubernetes/kubernetes/issues/64148

Still checking with AWS Support if that's the case for EKS also, but it seems very plausible.

answered Dec 14 '18 at 19:45

Martijn Burger

7,315
8
54
94

Create a Amazon EKS cluster with jenkins-x and cluster-autoscaler gives fails ingress on even number of nodes

2 Answers2