0

I have two API that are publicly exposed, let's say xyz.com/apiA and xyz.com/apiB.

Both these API are running node and are dockerized services running as individual pods in the same namespace of a Kubernetes cluster.

Now, apiA calls apiB internally as part of its code logic. The apiA service makes a POST call to apiB and sends with it a somewhat large payload in its body parameter. This POST request times out if the payload in its body is more than 30kb.

We have checked the server logs and that POST request is not seen.

The error prompt shows connection timeout to 20.xx.xx.xx which is the public ip address of xyz.com

I'm new to Kubernetes and would appreciate your help.

So far have tried this, but it didn't help.

Please let me know if more information is needed.

Edit: kubectl client and server version is 1.22.0

ams
  • 75
  • 1
  • 10
  • Have you tried decreasing the payload for testing purpose and see if that works without timeout, so that we can think of a way to increasing the call delay – Chandra Sekar Aug 13 '21 at 09:13
  • Yes. As I've mentioned in the question, payloads below 30kb work fine. For those with 50kb, 70kb and 100kb and above we're getting a timeout. – ams Aug 13 '21 at 09:25
  • If the timeout is happening at api level of application, I guess it needs to be checked at the app level to increase the timeout values for the app which is accepting the higher pay load value – Chandra Sekar Aug 13 '21 at 09:32
  • We've set all the maximum possible values at the API level. Also during debugging have found that the apiB is completely executed successfully and it is sending a 200 response. However, this response isn't finding its way back to the originating pod. It think its [this](https://github.com/kubernetes/kubernetes/issues/74839) issue which was seen in 1.14 version, has re-occured in 1.22 version. – ams Aug 13 '21 at 11:53
  • Do you use a Distributed Tracing System like [Jaeger](https://github.com/jaegertracing/jaeger) ? This may help with `Root cause analysis`. Could you please provide a [reproductible example](https://stackoverflow.com/help/minimal-reproducible-example) ? – matt_j Aug 13 '21 at 15:15

1 Answers1

1

To update the kind folks who took time to understand the problem and suggest solutions - The issue was due to bad routing. Internal APIs (apiB in the example above) should not be called using full domain name of xyz.com/apiB, rather they can be directly referenced using pod name as

http://pod_name.namespace.svc.local/apiB.

This will ensure internal calls are routed through Kubernetes DNS and don't have to go through load balancer and nginx, thus improving response time heavily.

Every call made to apiA was creating a domino effect by populating hundreds of calls to apiB and overloading the server, which caused it to fail only after a few thousand requests.

Lesson learned: Route all internal calls using cluster's internal network.

ams
  • 75
  • 1
  • 10