4

I'm using the JVM for a scientific application. The first step in my process is to load a lot of data into little double[] arrays (48-element arrays for each node in a large graph). Long before I get to the point where I find out if I have enough memory to load all of them, Java slows down asymptotically, and jvisualvm tells me that this is because nearly all of the CPU time is spent in garbage collection:

enter image description here

The first minute or so is fine: "used heap" (right plot) jumps up and down as it grows because some objects are temporary (I wrote this in Scala) and some objects are permanent. After that, however, the data-loading grinds to a halt because the garbage collector is apparently checking the same objects over and over (left plot). It must be expecting them to go out of scope, but I'm keeping them in scope because I want to use them for my analysis.

I know that the garbage collector puts objects in different generations, based on their likelihood of survival. The first generation contains objects that are recently created and likely to die soon; later generations are progressively more likely to be long-lived. If my objects are wrongly in the first generation, is there any way to tell the garbage collector that they ought to be in a later generation? I know that I'll be keeping them--- how can I tell the garbage collector?

Although I'd like these objects be in a more permanent generation, PermGen would be too far: they will die eventually, after tens of minutes of processing. (I want to use this in a Hadoop reducer, which might work on a different chunk of data after this one without a new JVM.)

Note: I'm using the Sun HotSpot VM:

% java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

Correction (to a previous edit): Changing the -Xmx does change the saturation point, but apparently Java ignores the -Xmx command line argument if it is passed after the -jar argument. That is, do

java -Xmx2048 -jar MyJarFile.jar

rather than

java -jar MyJarFile.jar -Xmx2048

Because of this, I was incorrectly diagnosing the behavior with respect to maximum heap and all the answers pointing to the -Xmx flag are valid.

The saturation point I describe happens when the "heap size" (orange on right plot) reaches the chosen -Xmx limit, and the "heap size" is always about 1.6 times the "used heap" (blue on right plot) unless you explicitly set the size of the "Old" generation with -XX:NewRatio or -XX:OldSize. These also need to be before the -jar argument, and they provide a lot of control.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • 2
    Can you not re use the objects in anyway ? – exussum Sep 25 '13 at 15:48
  • "Re-use the objects"? How do you mean? I'm loading them into memory to perform an analysis on all of them. They represent different data. – Jim Pivarski Sep 25 '13 at 15:52
  • they may represent different data, But they are likely the same object. a =12 and b=1534545 are differnt, but if im not using them at the same time i could re use a instead of creating a "new" b and having gc invoke on a – exussum Sep 25 '13 at 16:00
  • I intend to use all of these objects at the same time (for k-means clustering in partitions to be determined from the structure of the graph), so they must all be loaded into memory at the same time. – Jim Pivarski Sep 25 '13 at 16:51

3 Answers3

5

The GC should not be invoking its self in a spiral unless your heap is approaching a saturation condition. You need to increase your maximum heap size (-Xmx) - start with something approaching 2x your expected retention. You can also use the CMS collector, which can improve the situation with a large tenured set. You will also likely need to tune your new generation manually, as the old generation should not need to be swept on a regular basis.

You can also consider using NIO direct ByteBuffers. While they are designed for more efficient I/O operations, they can be a reasonable choice for very long lived and wide memory arrays.

Yann Ramin
  • 32,895
  • 3
  • 59
  • 82
1

I think you should check it using the VisualGC plugin of JVisualVM, so that you can see how the different generations are used. Based on the screenshots, it seems that the old generation is filled up (since the heap is not completely full, yet the GC is working hard), so the GC is having hard times freeing up memory. You should either increase the heap or tune the size of the generations with -XX:NewRatio and you can try adjusting the tenuring treshold as well to control when an object is considered "old".

Community
  • 1
  • 1
Katona
  • 4,816
  • 23
  • 27
  • This is in fact what was happening: the heap space was not full, but the "old" generation was. (The GC was correctly labeling my data as "old", but "old" was full. I don't understand, then, why the GC was working hard rather than failing.) The VisualGC plugin was very helpful for diagnosing the situation, and the `-XX:NewRatio` and its relatives (e.g. `-XX:OldSize`) are particularly useful for tuning the GC for a big in-memory calculation. Thanks! – Jim Pivarski Sep 25 '13 at 16:43
0

Objects aren't garbage collected if they are still being referenced. So just keep a reference to objects until you want them to be garbage collected.

jhocking
  • 5,527
  • 1
  • 24
  • 38