0

I have a Java/Spring data transfer service that reads records in from a .csv file, parses them, collates them in memory, and then loads them into a database.

Each run of parses a file that contains ~800k records.

The service is deployed to a Kubernetes container. Neither the production environment nor the application design are ideal for this application, but I have to operate within some restrictions.

After two or three runs, the service crashes due to a memory error. We are seeing long garbage collection times, and assume that garbage collection is not keeping pace.

We are using the G1 garbage collector, and I want to tune the collection to prioritize memory over speed. I don't care how efficient or fast the service is, it only has to perform this data transfer a few times.

What settings will accomplish this?

FerdTurgusen
  • 320
  • 5
  • 13
  • 1
    This should not happen unless you have a memory leak. Use a profiler to investigate. Visualvm is a good start. – Thorbjørn Ravn Andersen Mar 09 '22 at 00:21
  • Depending on the kind of collation you actually do, it _may_ be faster to just load the CSV into the database and transform the data via some SQL statements. – maio290 Mar 09 '22 at 00:26
  • @ThorbjørnRavnAndersen No, you're right. I should have noted that this is happening within a kubernetes container, and I'm not able to reproduce this on my local machine. It's possible that our production environment isn't suited for a container with these memory requirements, but I have limited ability to adjust the container environment. – FerdTurgusen Mar 09 '22 at 00:27
  • Then investigate there. Java flight recorder might be relevant too – Thorbjørn Ravn Andersen Mar 09 '22 at 00:29
  • And the less you hint the JVM the more freedom it has to make better choices of eg the gc – Thorbjørn Ravn Andersen Mar 09 '22 at 00:30
  • Could you please provide your Xmx configuration and pod memory limits? Maybe that is the reason you cannot reproduce it in your local. It can be the heap, but also memory is allocated for native, jit, classloaders ... Also the complete OOM will be helpfull as there many different OOM errors. – usuario Mar 09 '22 at 07:33

1 Answers1

2

We are seeing long garbage collection times, and assume that garbage collection is not keeping pace.

Long GC times are a symptom of the problem rather than the root cause of the problem. If the GC is simply not keeping up, that should not cause OOMEs.

(It possible that heavy use of finalizers, Reference objects or similar make it harder for the GC to keep up, but that is still a symptom. It seems likely that this is relevant in your use-case.)

My theory is that the real cause of the long collection times is that your heap is too small. When your heap is nearly full, the GC has to run more and more often and is able to reclaim less and less space. That leads to long collection times. Then finally, you get an OOME because either you run out of heap space entirely, or because you hit the GC overhead threshold.

Another possibility is that your heap is too big for the available RAM ... and you are getting virtual memory thrashing

In either case, simply tweaking the GC settings is not going to help. You need to identify the root cause of the problem before you can fix it.

My take is that either you have a memory leak, or not enough RAM, or there is a problem with your application's design.

On the design side, rather than reading / parsing the entire file as an in-memory data structure, use a streaming / event-based parser. Read records one at a time, process them and then discard them ... keeping as little information about them in memory as you can get away with. In other words, make the application less memory hungry.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216