intermittent 502 bad gateway

Question

Backstory First:

We have a deployment running that encounters intermittent 502s when trying to load test it with something like JMeter. It's a container that logs POST data to a mysql DB on another container. It handles around 85 requests per second pretty well, with no to minimal errors in Jmeter, however once this number starts increasing the error rate starts to increase too. The errors come back as 502 bad gateways in the response to jmeter:

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>

Now the interesting - or, rather confusing - part here is that this appears to be a NGINX error - we don't use NGINX for our ingress at all. It's all through IBM Cloud Bluemix etc.

We've deduced so far that these 502 errors occur when the request from Jmeter that returns this error does not actually hit our main.py script running on the container - there's no log of these errors at the pod level (using kubectl logs -n namespace deployment). Is there any way to intercept/catch errors that basically don't make it into the pod? So we can at least control what message a client gets back in case of these failures?

I see here https://cloud.ibm.com/docs/containers?topic=containers-ingress-types that IBM Cloud k8s service uses nginx ingress controller, did you install your own ingress controller and use a custom ingress class ? If not, than you are using nginx. Do you have any restarts on your pods? Maybe after 85rps you hit memory of cpu limits, do you have liveness/readiness probes? You can get intermittent 502s if your service looses its backend pods, do a describe on your service, does is have endpoints when you see the 502 in jmeter ? — acristu, Mar 10 '22 at 17:38
So, no restarts on the pods, no errors logged concerning 502s. When we run a describe there's also no indication there of any errors. The only indication of drops is the HTTP Request in Jmeter coming back with 502s. I'm not sure what you mean about "does is have endpoints" - like somewhere in the jMeter error response? — krzychostal, Mar 10 '22 at 21:40
I've added more details in an answer to have enough room...hope it helps... — acristu, Mar 11 '22 at 08:04

acristu · Answer 1 · 2022-03-11T08:12:05.773

I assume the setup is Ingress --> Service --> Deployment. From here https://cloud.ibm.com/docs/containers?topic=containers-ingress-types I conclude you are using nginx ingress controller since there is no mention of a custom ingress controller/ingress class being used.

The 502 appear only above 85 req/sec so the Ingress/Service/Deployment k8s resources are configured correctly...there should be no need to check your service endpoints and ingress configuration.

Please see below some troubleshooting tips for intermittent 502 errors from the ingress controller:

the Pods may not cope with the increase load (this might not apply to you since 85 req/sec is pretty low, also you said kubectl get pods shows 0 RESTARTS, but it may be useful to others):
- the pods hit memory/cpu limits if you have them configured, check for pod status OOMKilled for example in kubectl get pods; also do a kubectl describe on your pods/deploymet/replicaset and check for any errors
- the pods may not respond to Liveness Probe and the pod will get restarted, and you will see 502; do a kubectl describe svc <your service> | grep Endpoints and check if you have any backend pods Ready for your service
- the pods may not respond to Readiness Probe, in which case they will not be eligible as backend pods for your Service, again when you start seeing the 502 check if there are any Endpoints for the Service
Missing readiness probe: your pod will be considered Ready and become available as an Endpoint for your Service even though the application has not started yet. But this would mean getting the 502 only at the beginning of your jmeter test...so I guess this does not apply to your use case
- Are you scaling automatically? When the increases load does another pod start maybe without a readiness probe?
Are you using Keep Alive in Jmeter? You may run out of file descriptors because you are creating too many connections, however I don't see this resulting in 502, but it is still worth checking ...
The ingress controller itself cannot handle the traffic (at 85 req/sec this is hard to imagine, but adding it for the sake of completeness)
- if you have enough permissions you can do a kubectl get ns and look for the namespace containing the ingress controller, ingress-nginx or something similar. Look for pod restarts or other events in that namespace.
If none of the above points help continue your investigation, try other things, look for clues:
- Try to better isolate the issue, use kubectl port-forward instead of going through ingress. Can you inject more 85 req/sec ? If yes, then your Pods can handle the load and you have isolated the issue to the ingress controller.
- Try to start more replicas of your Pods
- Use Jmeter Throughput Shaping Timer Plugin and increase the load gradually; then monitoring what happens to your Service and Pods as the load increases, maybe you can find the exact trigger for the 502 and get more clues as to what could be the root cause

hey thanks for all the suggestions, so when local-hosting into the pod. So now instead of intermittent 502s (since we're not connecting through ingress) we get: org.apache.http.NoHttpResponseException: localhost:8080 failed to respond This is even when using "ENV WORKER_CLASS=uvicorn.workers.UvicornH11Worker", and monkey patching gevent in the main.py code. — krzychostal, Mar 14 '22 at 16:30
try using a simple `curl` command to troubleshoot, by "local-hosting" you mean `kubectl port-forward` ? — acristu, Mar 15 '22 at 17:46
Yes exactly, with port forwarding. It eventually hits a limit of concurrent requests it seems. It's still intermittent (like lets say a failure rate of less than 1%), but failure none the less. — krzychostal, Mar 18 '22 at 16:47

intermittent 502 bad gateway

1 Answers1