Does allocation performance degrade on a large number of live instances when using G1?

Question

While moving from CMS to G1 for some of our applications, I noticed that one of them suffered from a startup time extended by factor 4. Application stop time due to GC cycles is not the cause. On comparing application behaviour, I disovered that this one carries a whopping 250 million of live objects after startup (in a heap of 12G). Further investigation showed that the application had normal speed during the first 5 million allocations, but performance degraded more and more as the pool of live objects grew larger.

Further experiments indicated that once a certain threshold of live objects is reached, allocation of new objects will indeed slow down when using G1. I found that doubling the number of live objects seems to put a factor of around 2.5 on the time needed for that allocation. With other GC engines, the factor is only 2. This would indeed explain the slowdown.

There are two issues which make me doubt that conclusion, though:

The threshold of roughly around 5 million live instances seems to be related to the heap as a whole. With G1, I would have expected any such degrading threshold being related to a region, not to the heap as a whole.
I looked around for documents in the web explaining (or at least stating) this behaviour, but I found none. I did not even find recommendations of the sort 'Having more than xxx live objects is evil'.

So: it would be great if someone could either tell me that my observations are correct, and perhaps point me to some explaining document, or to some recommendatons regarding this area. Or, alternatively, that someone tells me what I'm doing wrong. :)

Here is a short test case (run multiple times, take mean values, deduct the displayed garbage collection times):

import java.util.HashMap;

/**
  * Allocator demonstrates the dependency between number of live objects
  * and allocation speed, using various GC algorithms.
  * Call it using, e.g.:
  *   java Allocator -Xmx12g -Xms12g -XX:+PrintGCApplicationStoppedTime -XX:+UseG1GC
  *   java Allocator -Xmx12g -Xms12g -XX:+PrintGCApplicationStoppedTime
  * Deduct stopped times from execution time.
  */
public class Allocator {

public static void main(String[] args) {
    timer(2000000, true);
    for (int i = 1000000; i <= 32000000; i*=2) {
        timer(i, false);
    }
    for (int i = 32000000; i >= 1000000; i/=2) {
        timer(i, false);
    }
}

private static void timer(int num, boolean warmup) {
    long before = System.currentTimeMillis();
    Allocator a = new Allocator();
    int size = a.allocate(num);
    long after = System.currentTimeMillis();
    if (!warmup) {
        System.out.println("Time needed for " + num + " allocations: "
           + (after - before) + " millis. Map size = " + size);
    }
}

private int allocate(int numElements) {
    HashMap<Integer, String> map = new HashMap<>(2*numElements);
    for (int i = 0; i < numElements; i++) {
        map.put(i, Integer.toString(i));
    }
    return map.size();
}

}

GC logs (via `PrintGCDetails`) and mentioning the java version you're using would be helpful — the8472, May 25 '15 at 09:44
Full GC logs of the actual application are far too large. :) But anyway, the above test case exhibits the behaviour. Please note that the problem here is *not* the GC pauses. - I use Java 8 Update 45. Behaviour is identical with Windows and Linux. — malamut, May 25 '15 at 09:57
the question is whether your test case causes the same problems as your actual workload. having a single hashmap holding unto millions of objects will create a very large reference array inside the hash map, which will probably require a region for itself, thus making most of the references from the hashmap table to its nodes cross-region, which requires more book-keeping. — the8472, May 25 '15 at 10:31
Good idea, but it seems the single large array is not the problem as well. The original application holds all those instances in a ConcurrentHashMap. My first version of the test case did the same. Using a ConcurrentHashMap creates a lot of (small) additional instances, so that the problem shows off even stronger. Replacing the single HashMap in the test case with 200 HashMaps, all of which are filled using map[i % 200].put(), also makes the effect stronger. — malamut, May 25 '15 at 11:28
Re my previous comment: if I do not create the HashMaps all at once, but fill each HashMap right after creation (so that elements are constructed close to their storage palce), it seems to help G1 a bit as compared to the single HashMap. The effect is still strongly visibly, however. — malamut, May 25 '15 at 11:46
can you update the test-case? and one thing you could try is playing with the region size. or simply forcing a full GC after startup to clear out all the regions that G1's heuristics would be reluctant to touch due to all the cross-region references. — the8472, May 25 '15 at 13:24
the8472, it seems you were right all the time. The more I optimize the test case towards getting rid of cross-region references, the closer G1's performance gets to standard performance. I'm now down to creating more instances than the original application, in less than standard time (counting in GC time), and there is still quite a few cross references left. I consider this issue solved. Please leave a formal answer here, so that I can formally close it. — malamut, May 25 '15 at 14:15

score 1 · Accepted Answer · answered May 25 '15 at 14:50

As discussed in the comments above:

Your test-case does pre-allocate very large reference arrays that are long-lived and essentially occupy a region of their own (they probably end up in the old gen or humongous regions) and then populating them with millions of additional objects that are likely to live in a different region.

This creates lots of cross-region references, which G1 can deal with in in moderate amounts but not in the millions per region.

Highly interconnected regions are also considered as expensive to collect by G1's heuristics and thus less likely to get collected even if they entirely consist of garbage.

Allocating objects together to reduces cross-region references.

Not articially extending their lifetime too much (e.g. by putting them into some sort of cache) also allows them to die during a young generation GC, which are much easier to collect than old regions which by nature accumulate objects referenced from different regions.

So all in all your test-case was rather hostile to the region-based nature of G1.

Yes, the test case is hostile, but the actual application is hostile in the exact same way, so the test case did its job. :) I have learned the lession that, with G1, it must be ensured that the number of cross-region-references does not grow far beyond a million. The easiest way to achive this will often be to make sure that the total number of live objects does not grow into multi-million range. — malamut, May 28 '15 at 09:53

Does allocation performance degrade on a large number of live instances when using G1?

1 Answers1