I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
- The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
- New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
- We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
- The memory is set to 2gb, changing this to 3gb has no significant impact
- The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
- Switching between GCs (G1 and ZGC) has no impact
- We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
- New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
- The throttling is observed in custom load tests which replicates the warmup requests described above
- Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
- This is all done on a managed kubernetes by AWS
- All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).