3

I set ParallelGCThreads=1 and use G1 GC, all other JVM settings are default. I run PageRank on Spark-1.5.1 with two EC2 nodes 100 GB heap each.

My heap usage graph is below (Red area: young generation, black area: old generation). All young GCs are small, all of a sudden there comes one young GC that collects 60 GB, then young GCs become small again. My GC logs shows no mixed GCs, no full GCs, one concurrent marking, and tens of young GCs. I am wondering why that huge young GC happens?

enter image description here

A part of my GC log below. The huge young GC is the one with "Heap: 84.1G"

2015-12-30T06:59:02.488+0000: 245.088: [GC pause (young) 245.089: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 1727, predicted base time: 24.64 ms, remaining time: 175.36 ms, target pause time: 200.00 ms]
 245.089: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 206 regions, survivors: 3 regions, predicted young region time: 148.87 ms]
 245.089: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 206 regions, survivors: 3 regions, old: 0 regions, predicted pause time: 173.51 ms, target pause time: 200.00 ms]
2015-12-30T06:59:02.531+0000: 245.131: [SoftReference, 0 refs, 0.0000520 secs]2015-12-30T06:59:02.531+0000: 245.131: [WeakReference, 21 refs, 0.0000160 secs]2015-12-30T06:59:02.531+0000: 245.131: [FinalReference, 9759 refs, 0.0084720 secs]2015-12-30T06:59:02.539+0000: 245.140: [PhantomReference, 0 refs, 14 refs, 0.0000190 secs]2015-12-30T06:59:02.539+0000: 245.140: [JNI Weak Reference, 0.0000130 secs] 245.142: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 12.51 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0534140 secs]
   [Parallel Time: 42.3 ms, GC Workers: 1]
      [GC Worker Start (ms):  245088.6]
      [Ext Root Scanning (ms):  14.4]
      [Update RS (ms):  1.9]
         [Processed Buffers:  34]
      [Scan RS (ms):  0.4]
      [Code Root Scanning (ms):  0.0]
      [Object Copy (ms):  25.5]
      [Termination (ms):  0.0]
      [GC Worker Other (ms):  0.0]
      [GC Worker Total (ms):  42.3]
      [GC Worker End (ms):  245130.9]
   [Code Root Fixup: 0.0 ms]
   [Code Root Migration: 0.0 ms]
   [Clear CT: 1.6 ms]
   [Other: 9.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 8.6 ms]
      [Ref Enq: 0.2 ms]
      [Free CSet: 0.4 ms]
   [Eden: 6592.0M(6592.0M)->0.0B(58.8G) Survivors: 96.0M->128.0M Heap: 30.6G(100.0G)->24.2G(100.0G)]
 [Times: user=0.05 sys=0.00, real=0.06 secs] 
2015-12-30T06:59:43.451+0000: 286.051: [GC pause (young) 286.054: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 392599, predicted base time: 367.03 ms, remaining time: 0.00 ms, target pause time: 200.00 ms]
 286.054: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1884 regions, survivors: 4 regions, predicted young region time: 150.18 ms]
 286.054: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1884 regions, survivors: 4 regions, old: 0 regions, predicted pause time: 517.21 ms, target pause time: 200.00 ms]
2015-12-30T06:59:47.767+0000: 290.368: [SoftReference, 0 refs, 0.0000570 secs]2015-12-30T06:59:47.768+0000: 290.368: [WeakReference, 350 refs, 0.0000640 secs]2015-12-30T06:59:47.768+0000: 290.368: [FinalReference, 99336 refs, 0.3781120 secs]2015-12-30T06:59:48.146+0000: 290.746: [PhantomReference, 0 refs, 1 refs, 0.0000290 secs]2015-12-30T06:59:48.146+0000: 290.746: [JNI Weak Reference, 0.0000140 secs] 290.767: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 11.74 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 4.7153740 secs]
   [Parallel Time: 4313.9 ms, GC Workers: 1]
      [GC Worker Start (ms):  286053.9]
      [Ext Root Scanning (ms):  15.2]
      [Update RS (ms):  86.3]
         [Processed Buffers:  1557]
      [Scan RS (ms):  4.1]
      [Code Root Scanning (ms):  0.2]
      [Object Copy (ms):  4208.1]
      [Termination (ms):  0.0]
      [GC Worker Other (ms):  0.0]
      [GC Worker Total (ms):  4313.9]
      [GC Worker End (ms):  290367.8]
   [Code Root Fixup: 0.0 ms]
   [Code Root Migration: 0.3 ms]
   [Clear CT: 15.1 ms]
   [Other: 386.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 378.4 ms]
      [Ref Enq: 1.7 ms]
      [Free CSet: 3.3 ms]
   [Eden: 58.9G(58.8G)->0.0B(3456.0M) Survivors: 128.0M->1664.0M Heap: 84.1G(100.0G)->26.7G(100.0G)]
 [Times: user=0.78 sys=3.94, real=4.71 secs] 
zero323
  • 322,348
  • 103
  • 959
  • 935
user2616566
  • 63
  • 1
  • 3
  • Why do you have only one GC worker? How many CPUs do you have? – Peter Lawrey Jan 03 '16 at 14:10
  • Can you post your GC options? – the8472 Jan 04 '16 at 00:50
  • @PeterLawrey Thanks! I had only one GC worker because I am trying to study GC behavior. I have 16 CPUs, 8 cores each – user2616566 Jan 04 '16 at 02:44
  • @the8472 Thanks! My GC options: "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:+UseG1GC -XX:ParallelGCThreads=1 – user2616566 Jan 04 '16 at 02:45

1 Answers1

5

attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 11.74 %, threshold: 10.00 %

My guess is that this is driving G1's decisions. You can relax it by setting -XX:GCTimeRatio=4, which would allow it to take up 20% of CPU cycles relative to application time for GCing instead of 10%.

If that's too much you should either

  • allow it to use more CPU cores - that would make it easier to meet its pause time goals which in turn means it can defer collections for longer, making it easier to meet the throughput goal.
    yes, this does mean using more cores can actually use fewer CPU cycles overall.
  • relax pause time goals so it has to collect less often
the8472
  • 40,999
  • 5
  • 70
  • 122
  • Thank! I am trying the options you suggested. As I am studying G1 behavior, can you see why G1 suddenly made a decision to expend Eden so much? After all, all other collections were so small. And, do you know what equation does G1 follow to decide GC size? – user2616566 Jan 04 '16 at 02:53
  • I only have a rough idea how it works and would have to google the details myself. The AdaptiveSizePolicy- and (obviously) G1-related flags in the `PrintFlagsFinal` are a good starting point. Some of them do affect how aggressive resizing is. But you can clearly see that its failing its throughput goal and getting close to failing its pause time goal so its kinda obvious that G1 has difficulties with the constraints that it is given. – the8472 Jan 04 '16 at 04:31
  • correction: the predictions for the pause time goal are close to failing its goal, actual pause times seem to be smaller. so the predictor is off-mark for some reason. – the8472 Jan 04 '16 at 04:38