2

We recently made a change to our application's persistence mechanism, replacing a slow serialization via reflection to create strings of SQL statements that were executed against a MySQL database with serialization via OpenHFT Chronicle, with hard-coded appends of the various object's fields to the excerpt. As expected this decreased the amount of time it took to persist the transaction substantially and reduced our latency. But we noticed something puzzling about the garbage collection pattern before and after the change. While the we were generating less garbage, the amount of time each garbage collection took went way up, and we're having trouble seeing why.

Here's an example of two runs, before and after the change, with the same VM args running with Java 1.8.0_74 on RedHat 4.4.7CentOS 6.7. The "After" run appears to consume less young generation, and less total at the start of each GC, approximately the same heap at the end of the GC, but takes much more time to actually run the GC, and to run much less frequently (about 300s between GCs instead of 70-80s).

We're trying to figure out how this is happening, and whether there's anything we can do to get the GC pauses back down to where they were when we we generating garbage a lot faster. Is there something about the Chronicle and the way it uses off-heap memory-mapped files that causes this? It seems unlikely, but it seems to be implicated unless it really is just something about the pattern of how the garbage is being generated is causing it. FWIW, as an experiment we tried both G1 and CMS and, at least with default params, they show the same pattern and if anything worse times.

VM args:

-Xms75776M
-Xmx75776M
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-Dsun.rmi.dgc.server.gcInterval=54000000
-Dsun.rmi.dgc.client.gcinterval=54000000
-XX:+DisableExplicitGC
-XX:MaxNewSize=19000M
-XX:+PrintGCApplicationStoppedTime
-XX:InitialSurvivorRatio=3

Before:

15009.601: [GC (Allocation Failure) [PSYoungGen: 19226640K->142320K(19273216K)] 22796282K->3814635K(77411840K), 0.0596127 secs] [Times: user=0.53 sys=0.05, real=0.06 secs]
15080.410: [GC (Allocation Failure) [PSYoungGen: 19235824K->137920K(19282432K)] 22908139K->3913367K(77421056K), 0.0570436 secs] [Times: user=0.50 sys=0.06, real=0.06 secs]
15154.485: [GC (Allocation Failure) [PSYoungGen: 19243712K->129840K(19279360K)] 23019159K->4007451K(77417984K), 0.0605860 secs] [Times: user=0.55 sys=0.05, real=0.06 secs]
.
.
bunch of lines omitted to get to approximately the same end-point
.
16070.663: [GC (Allocation Failure) [PSYoungGen: 19195664K->146464K(19262976K)] 24408408K->5460932K(77401600K), 0.0616580 secs] [Times: user=0.55 sys=0.06, real=0.06 secs]

After:

15313.256: [GC (Allocation Failure) [PSYoungGen: 15092194K->2230066K(15050752K)] 16440428K->3981360K(73189376K), 0.3563350 secs] [Times: user=3.40 sys=0.15, real=0.36 secs]
15770.413: [GC (Allocation Failure) [PSYoungGen: 15050546K->2922738K(15321600K)] 16801840K->4674033K(73460224K), 0.3584541 secs] [Times: user=3.56 sys=0.00, real=0.36 secs]
16090.016: [GC (Allocation Failure) [PSYoungGen: 14896370K->3571346K(15244288K)] 16647665K->5322641K(73382912K), 0.4276084 secs] [Times: user=4.25 sys=0.00, real=0.42 secs]

Update: So I performed the following experiment based on some of the suggestions below (this is on a UAT machine, not the PROD machine, so it's a little different... e.g. the linux version is CentOS 7.2 instead of 6.7). We limit the size of the heap to 32G and set an explicit GC pause goal of 200ms, as well as turning on more diagnostic info. If anything the results are even worse. It starts out quite reasonable but blows right by the suggested pause time up to 1.94 seconds.

java -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -Dsun.rmi.dgc.server.gcInterval=54000000 -Dsun.rmi.dgc.client.gcinterval=54000000 -XX:+DisableExplicitGC -Xms32000M -Xmx32000M


{Heap before GC invocations=1 (full 0):
 PSYoungGen      total 9557504K, used 4096257K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 50% used  [0x00007f58fe580000,0x00007f59f85c0458,0x00007f5af2600000)
  from space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
  to   space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
 ParOldGen       total 21845504K, used 0K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9000000,0x00007f58fe580000)
 Metaspace       used 20877K, capacity 21118K, committed 21296K, reserved 1069056K
  class space    used 2499K, capacity 2599K, committed 2608K, reserved 1048576K
2.420: [GC (Metadata GC Threshold) 
Desired survivor size 1397751808 bytes, new threshold 7 (max 15)
[PSYoungGen: 4096257K->13270K(9557504K)] 4096257K->13366K(31403008K), 0.0181755 secs] [Times: user=0.11 sys=0.05, real=0.02 secs] 
Heap after GC invocations=1 (full 0):
 PSYoungGen      total 9557504K, used 13270K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af32f59a8,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 96K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9018010,0x00007f58fe580000)
 Metaspace       used 20877K, capacity 21118K, committed 21296K, reserved 1069056K
  class space    used 2499K, capacity 2599K, committed 2608K, reserved 1048576K
}
{Heap before GC invocations=2 (full 1):
 PSYoungGen      total 9557504K, used 13270K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af32f59a8,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 96K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9018010,0x00007f58fe580000)
 Metaspace       used 20877K, capacity 21118K, committed 21296K, reserved 1069056K
  class space    used 2499K, capacity 2599K, committed 2608K, reserved 1048576K
2.438: [Full GC (Metadata GC Threshold) [PSYoungGen: 13270K->0K(9557504K)] [ParOldGen: 96K->12747K(21845504K)] 13366K->12747K(31403008K), [Metaspace: 20877K->20877K(1069056K)], 0.0530698 secs] [Times: user=0.27 sys=0.05, real=0.05 secs] 
Heap after GC invocations=2 (full 1):
 PSYoungGen      total 9557504K, used 0K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 12747K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c72e90,0x00007f58fe580000)
 Metaspace       used 20877K, capacity 21118K, committed 21296K, reserved 1069056K
  class space    used 2499K, capacity 2599K, committed 2608K, reserved 1048576K
}
{Heap before GC invocations=3 (full 1):
 PSYoungGen      total 9557504K, used 8192512K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 100% used [0x00007f58fe580000,0x00007f5af2600000,0x00007f5af2600000)
  from space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 12747K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c72e90,0x00007f58fe580000)
 Metaspace       used 22891K, capacity 23188K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
1332.241: [GC (Allocation Failure) 
Desired survivor size 1397751808 bytes, new threshold 7 (max 15)
[PSYoungGen: 8192512K->291885K(9557504K)] 8205259K->304649K(31403008K), 0.2443282 secs] [Times: user=0.48 sys=3.83, real=0.25 secs] 
Heap after GC invocations=3 (full 1):
 PSYoungGen      total 9557504K, used 291885K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 21% used [0x00007f5b45b00000,0x00007f5b5780b650,0x00007f5b99000000)
  to   space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
 ParOldGen       total 21845504K, used 12763K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c76e90,0x00007f58fe580000)
 Metaspace       used 22891K, capacity 23188K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
}
{Heap before GC invocations=4 (full 1):
 PSYoungGen      total 9557504K, used 8484397K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 100% used [0x00007f58fe580000,0x00007f5af2600000,0x00007f5af2600000)
  from space 1364992K, 21% used [0x00007f5b45b00000,0x00007f5b5780b650,0x00007f5b99000000)
  to   space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
 ParOldGen       total 21845504K, used 12763K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c76e90,0x00007f58fe580000)
 Metaspace       used 22939K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
2858.320: [GC (Allocation Failure) 
Desired survivor size 1397751808 bytes, new threshold 7 (max 15)
[PSYoungGen: 8484397K->677978K(9557504K)] 8497161K->690750K(31403008K), 0.5290410 secs] [Times: user=0.74 sys=8.70, real=0.53 secs] 
Heap after GC invocations=4 (full 1):
 PSYoungGen      total 9557504K, used 677978K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 49% used [0x00007f5af2600000,0x00007f5b1bc16a80,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 12771K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c78e90,0x00007f58fe580000)
 Metaspace       used 22939K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
}
{Heap before GC invocations=5 (full 1):
 PSYoungGen      total 9557504K, used 8870490K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 100% used [0x00007f58fe580000,0x00007f5af2600000,0x00007f5af2600000)
  from space 1364992K, 49% used [0x00007f5af2600000,0x00007f5b1bc16a80,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 12771K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c78e90,0x00007f58fe580000)
 Metaspace       used 22946K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
4467.385: [GC (Allocation Failure) 
Desired survivor size 1397751808 bytes, new threshold 7 (max 15)
[PSYoungGen: 8870490K->1096779K(9557504K)] 8883262K->1109558K(31403008K), 0.9701847 secs] [Times: user=1.96 sys=15.40, real=0.97 secs] 
Heap after GC invocations=5 (full 1):
 PSYoungGen      total 9557504K, used 1096779K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 80% used [0x00007f5b45b00000,0x00007f5b88a12d00,0x00007f5b99000000)
  to   space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
 ParOldGen       total 21845504K, used 12779K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c7ae90,0x00007f58fe580000)
 Metaspace       used 22946K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
}
{Heap before GC invocations=6 (full 1):
 PSYoungGen      total 9557504K, used 9289291K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 100% used [0x00007f58fe580000,0x00007f5af2600000,0x00007f5af2600000)
  from space 1364992K, 80% used [0x00007f5b45b00000,0x00007f5b88a12d00,0x00007f5b99000000)
  to   space 1364992K, 0% used [0x00007f5af2600000,0x00007f5af2600000,0x00007f5b45b00000)
 ParOldGen       total 21845504K, used 12779K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 0% used [0x00007f53c9000000,0x00007f53c9c7ae90,0x00007f58fe580000)
 Metaspace       used 22951K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
6153.256: [GC (Allocation Failure) 
Desired survivor size 2375548928 bytes, new threshold 6 (max 15)
[PSYoungGen: 9289291K->1364987K(9557504K)] 9302070K->2238887K(31403008K), 1.9590450 secs] [Times: user=5.38 sys=29.68, real=1.96 secs] 
Heap after GC invocations=6 (full 1):
 PSYoungGen      total 9557504K, used 1364987K [0x00007f58fe580000, 0x00007f5b99000000, 0x00007f5b99000000)
  eden space 8192512K, 0% used [0x00007f58fe580000,0x00007f58fe580000,0x00007f5af2600000)
  from space 1364992K, 99% used [0x00007f5af2600000,0x00007f5b45afed70,0x00007f5b45b00000)
  to   space 1364992K, 0% used [0x00007f5b45b00000,0x00007f5b45b00000,0x00007f5b99000000)
 ParOldGen       total 21845504K, used 873899K [0x00007f53c9000000, 0x00007f58fe580000, 0x00007f58fe580000)
  object space 21845504K, 4% used [0x00007f53c9000000,0x00007f53fe56ae90,0x00007f58fe580000)
 Metaspace       used 22951K, capacity 23252K, committed 23472K, reserved 1071104K
  class space    used 2602K, capacity 2688K, committed 2736K, reserved 1048576K
}

3 Answers3

1

You need some more detailed stats, consider something like visualgc to watch what is happening in the different parts of the heap as the load develops & flags like -XX:+PrintTenuringDistribution for delving into what is going on in the eden related spaces.

The fact that the total heap size is going up and up in the 2nd case suggests that you are gradually bleeding into tenured so the increased time may be the cost of the survivors pinging between the from & to survivor spaces (and then eventually being promoted to tenured). Note that this is a supposition based on the data available, we need more detailed stats to really understand what is going on (or an understanding of how chronicle behaves which I don't personally have).

This may mean that you have a memory leak or that eden is now simply too small (whether in terms of maxtenuringthreshold or just the sheer size of eden, generally speaking some tuning of eden may be required to fit your new heap usage pattern).

Matt
  • 8,367
  • 4
  • 31
  • 61
1

You're concerned about pause times, yet you're not setting a pause time goal (MaxGCPauseMillis). The default collector (ParallelGC) has no default pause time goal, so you have to specify one explicitly.

When set the GC will try to adjust the young gen size and tenuring thresholds to meet those goals.

FWIW, as an experiment we tried both G1 and CMS and, at least with default params, they show the same pattern and if anything worse times.

G1 does have a default pause goal of 200ms, so pause times going up into that range wouldn't be unexpected.

the8472
  • 40,999
  • 5
  • 70
  • 122
  • We can try it again with explicit pause time goals, but the first experiment with G1 and the default pause of 200ms showed GC pauses well in excess of that, in the 500-600ms range. Still, I wasn't aware that the Parallel Collector even tried to honor pause time goals, so it's worth a shot. – Joshua Macy Jul 29 '16 at 12:27
1

I would look at keeping the heap below 32 GB so you can use compressed oops.

By having more collections you may be spreading the work over time. I would try skrinking the new space as well.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Hi Peter -- I was working with Joshua on this also. We tried shrinking Young Gen and it didn't make any difference. We cut it to 1/2 the original size (shown above). What happened is that the GCs occurred twice as often, but still took the same amount of time as when the Young Gen was twice as big. – Sam Goldberg Jul 29 '16 at 13:29
  • @SamGoldberg In that case, I guess you have a lot of medium lived objects which are being copied between survivor spaces. If you use 32-bit references it should reduce GC time and speed up the application. – Peter Lawrey Jul 29 '16 at 17:13
  • Actually, that's not the whole story. Yes, cutting the Young Gen in half didn't make much of a difference, but at least in test cutting it even further (down to 2G while limiting the heap to 32G for compressed oops) reduced the average pause from 0.4s to 0.2s while leaving the max pause at around 0.88s, which is still about 7 times higher than PROD before the change. – Joshua Macy Aug 12 '16 at 12:46
  • Are you sure this is the only change? Can you revert it and try again? – Peter Lawrey Aug 12 '16 at 22:32