0

I suspect a slow memory leak caused by JNI code. I'm seeing the GC Roots # in the summary of consecutive heap dumps climbing. After two hours it showed 470,000 GC roots, after six hours, almost a million more GC roots, after 33 hours over 7 million GC roots.

However, when I look at the Head Dump that says there are 7 million GC roots, and I choose the "Objects" view and the "GC Roots" preset, I select Aggregation "Types". This list shows a total count of less than 15,000 objects:

  • JNI global - count 7857
  • JNI local - count 5
  • Java frame - count 983
  • monitor used - count 7
  • sticky class - count 3596
  • thread object - count 145

Where are those 7 million roots?

swpalmer
  • 3,890
  • 2
  • 23
  • 31

1 Answers1

1

Different GC roots can reffer to the same object instance. This explains the difference, since count from "GC root" view display number of unique instances. You can find more details using OQL. First let's display number of GC roots for particular GC type:

printHistogram()

function printHistogram() {
  var roots = heap.roots()
  var histoMap = new Array();
  var result = new Array();
  var cnt = 0;
 
  while (roots.hasMoreElements()) {
    var root = roots.nextElement();
    var type = root.type;
    if (histoMap[type] == undefined) {
      histoMap[type] = 1;
    } else {
      histoMap[type]++;
    }
  }
  for (var key in histoMap){
    if (histoMap.hasOwnProperty(key)) {
      result[cnt++] = { key: key, count: histoMap[key] };
  }
  return map(sort(result, "rhs.count - lhs.count"), '"Root count: "+it.count+" for type: "+it.key');
}

Running this query against your heap dump produces:

Root count: 12878043 for type: JNI local
Root count: 7858 for type: JNI global
Root count: 3599 for type: sticky class
Root count: 1631 for type: Java frame
Root count: 146 for type: thread object
Root count: 7 for type: monitor used

We can see that the majority of GC roots are of "JNI local" type. Let us see how many "JNI local" roots point to the same object instance. We can modify above query to:

printHistogram()

function printHistogram() {
  var roots = heap.roots()
  var histoMap = new Array();
  var rootMap = new Array();
  var result = new Array();
  var cnt = 0;

  while (roots.hasMoreElements()) {
    var root = roots.nextElement();
    if (root.type == "JNI local") {
      var objid = root.id;
      if (histoMap[objid] == undefined) {
        histoMap[objid] = 1;
        rootMap[objid] = root.referrer;
      } else {
        histoMap[objid]++;
      }
    }
  }
  for (var key in histoMap){
    if (histoMap.hasOwnProperty(key)) {
      result[cnt++] = { key: rootMap[key], count: histoMap[key] };
    }
  }

  return map(sort(result, "rhs.count - lhs.count"), '"Root count: "+it.count+" for object: "+toHtml(it.key)');
}

The result is below:

Root count: 6439020 for object: java.lang.String#44429
Root count: 6439020 for object: java.lang.String#55081
Root count: 1 for object: java.nio.DirectByteBuffer#9
Root count: 1 for object: java.util.ArrayList#22281
Root count: 1 for object: java.lang.String#71518

We can see that two strings java.lang.String#44429 and java.lang.String#55081 are responsible for that huge number of GC roots. They have 6,5 million GC roots each.

Those "JNI local" GC root are referenced from MessageDispatherThread-1 (tid=216) from frame ca.digitalrapids.kayak.jni.KayakNativeWorkerThread.runNative (Native Method). See screenshots below:

stacktrace

locals

Tomas Hurka
  • 6,723
  • 29
  • 38
  • Thanks very much for this analysis! Now I think I understand what is happening. There are native threads calling in to Java at around 30 times per second (video framerate callbacks), since those threads never "return to Java" there can be no automatic clean up of the local references. There must be a missing call to DeleteLocalRef. – swpalmer Jan 03 '23 at 14:09
  • Yes, see my update. I added two screenshots with thread and affecting frame. – Tomas Hurka Jan 03 '23 at 14:16
  • I tried your second script and after some time it failed: javax.script.ScriptException: org.graalvm.polyglot.PolyglotException: TypeError: invokeMember (findThing) on org.graalvm.visualvm.lib.profiler.oql.engine.api.impl.Snapshot@233b05de failed due to: Cannot convert '4038369040'(language: Java, type: java.lang.String) to Java type 'long': Invalid or lossy primitive coercion. – swpalmer Jan 03 '23 at 15:01
  • Please run VisualVM on JDK 11 or JDK 8, which bundles Nashorn javascript engine. – Tomas Hurka Jan 03 '23 at 17:56
  • 1
    I updated the second script to run also on GraalJS (it is used as javascript engine on JDK 17+) – Tomas Hurka Jan 04 '23 at 09:58