0

We are running 3 replicas of a java/spring boot microservice in a docker/k8s environment. After days of running without any issues one of the 3 becomes very slow due to extreme cpu usage and very high gc pauses. One specific odd metrics that caught my eye was that the jvm (for whatever reason) decided to shrink the eden space and enlarge the old gen. (by a factor of 6 approximately)

Even after restarting the pod/container the odd eden/old gen ratio and extreme cpu usage occurs. The three replicas serve the same requests and only one of three shows this behavior.

What could cause such behavior? (jvm args are -Xmx5900m -Xms5g)

(in this screenshot between 14:15 and 14:20 the issue occurs)

Paul7979
  • 210
  • 1
  • 3
  • 10
  • resizing regions is normal. When your application generates little garbage (or it can be reclaimed really fast) and `MaxGCPauseMillis` can be met with fewer young regions, they will be made less. you can look at the GC logs (an example [here](https://stackoverflow.com/questions/60567878/how-to-properly-read-some-parts-of-the-logging-from-gc-with-xlogheap-debug)) and see that yourself, I guess. There are options to disable this. – Eugene Jun 01 '21 at 18:22
  • What puzzels me: this application is under full load during the day suddenly resizing happens but when almost no load is generated during night hours this doesn't happen? – Paul7979 Jun 01 '21 at 18:28
  • you said `k8s`, what are the values of `requests` and `limits`? also, you need to look at he GC logs, they might be able to tell a lot more of what is going on. – Eugene Jun 01 '21 at 18:42
  • Resource request an limits are set to match the heap size + overhead. What exactly should i look out for in the GC logs? – Paul7979 Jun 01 '21 at 20:03
  • How can we know what happens to *your* service if we see no code, no GC logs, no even a JVM version? – apangin Jun 01 '21 at 21:35
  • I don’t think it‘d be helpful if i posted our 120k lines service, sorry. Java version adoptopenjdk jre-11.0.7_10-alpine. Unfortunately I don’t have GC logs for that time period. I wanted to now what to look out for when something like this happens. (I know that nobody will be able to provide an easy fix - but maybe a direction to look at.) – Paul7979 Jun 02 '21 at 05:46
  • 1
    Your question sounds like this is a recurring event, so when you don’t have GC logs yet, you can turn on logging *now*. Besides that, it’s striking that the *used* size of the old generation is constantly growing, from 600MB to 2500MB within half an hour for an application your say running for days, which is likely the cause of the region resizing. It’s not the resizing you should worry about. – Holger Jun 02 '21 at 08:04
  • Yes I turned it on now, will come back with further information. Yeah that caught my eye too but this happens as the old gen committed spaced is so huge after resizing, no major gc happens. When looking at old gen space prior resizing it stayed between 400 and 800mb for 2 days. – Paul7979 Jun 02 '21 at 09:17

0 Answers0