11

I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.

I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split() each line on whitespace and store the resulting String arrays in an ArrayList<String[]> for further processing. Here's a simplified program that exhibits the problem:

/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
    private static final int INITIAL_SENTENCES = 66000000;

    public static void main(String[] args) throws IOException {
        List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
        BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
        long numTokens = 0;
        String line;

        while ((line = stdin.readLine()) != null) {
            String[] sentence = line.split("\\s+");
            if (sentence.length > 0) {
                sentences.add(sentence);
                numTokens += sentence.length;
            }
        }
        System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
    }
}

Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:

  • 66e6 String[] references @ 8 bytes ea = 0.5 GB
  • 66e6 String[] objects @ 32 bytes ea = 2 GB
  • 66e6 char[] objects @ 32 bytes ea = 2 GB
  • 1.3e9 String references @ 8 bytes ea = 10 GB
  • 1.3e9 Strings @ 44 bytes ea = 53 GB
  • 8e9 chars @ 2 bytes ea = 15 GB

83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens just to be safe, and kick back while I watch top.

Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.

To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.

So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC and it seems even worse.

Jay Hacker
  • 1,835
  • 2
  • 18
  • 23
  • Where are the `char[]` objects backing the strings? – Jon Skeet Mar 06 '12 at 23:32
  • In the `String` objects: 24 byte object header + 8 byte `char[]` pointer + 4 byte start, offset, and hashcode, if my calculations are correct. – Jay Hacker Mar 07 '12 at 14:44
  • That's the `char[]` *reference* - but what about the `char[]` *objects* themselves? A `char[]` array has an object overhead too... – Jon Skeet Mar 07 '12 at 14:46
  • Ah, you are right! I added it in. But that's still chump change in the scheme of things, and far less memory than I've got -- what gives?? – Jay Hacker Mar 07 '12 at 15:43
  • @Jay: where are you from (your location isn't set)? At http://www.nosid.org/java-set-integer-memory-overhead.html you can find a **German blog entry** about beating Java's GC into submission with large data sets, a solution (Jon Skeet's Idea 2), and some performance measures. The main message and performance should also be understandable for non-germans from the given code and numbers... – DaveFar Mar 07 '12 at 15:52
  • @DaveBall: Between Google and my co-worker, we figured out he's talking about using arrays of primitives instead of wrapper types, which is great advice I have used elsewhere. Unfortunately, in this case I really am storing Objects (pointers), so I can't use it. (Also, it's sad that Java forced this poor gent to use log(n) binary searches for "set" membership just to save space!) If you're suggesting I just store my input in one big `char[]` (or `byte[]`), I run into another patch of Java suck: arrays can only hold 2 billion things! – Jay Hacker Mar 07 '12 at 16:34

4 Answers4

4

-XX:+UseConcMarkSweepGC: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.

Jay Hacker
  • 1,835
  • 2
  • 18
  • 23
  • I often use CMS for java server with large heap to reduce gc impact on response time. I was not convinced changing the policy would help your code in such a task. I guess using CMS has changed the way the heap is splitted into parts and your JVM gets a larger OldGen. – Yves Martin Mar 08 '12 at 07:07
2

Idea 1

Start by considering this:

while ((line = stdin.readLine()) != null) {

It at least used to be the case that readLine would return a String with a backing char[] of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:

String[] sentence = line.split("\\s+");

You should determine whether the strings returned by split keep the same backing char[].

If they do (and assuming your lines are often shorter than 80 characters) you should use:

line = new String(line);

This will create a clone of the copy of the string with a "right-sized" string array

If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[] (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[] per word, as that'll waste far more memory than the spaces.

Idea 2

Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][], at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc to check beforehand. Just to see if you can avoid that problem to start with.

Idea 3

How many distinct words are there in your corpus? Have you considered keeping a HashSet<String> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[] per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Re: Idea 3, might you consider using `String.intern()`? – Louis Wasserman Mar 06 '12 at 23:45
  • @LouisWasserman: Potentially - but only if the process wasn't going to do anything else. I generally prefer to have my own interning set, to avoid "polluting" the process-wide one. (Although there may be funky things to mean that's not a problem these days. It just *feels* cleaner.) – Jon Skeet Mar 06 '12 at 23:53
  • 2
    Hmmm. Alternate suggestion -- Guava's [`Interners.newWeakInterner`](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/Interners.html#newWeakInterner()) to do it with weak references, just so the interned strings can get GC'd when you're done. – Louis Wasserman Mar 06 '12 at 23:57
  • @LouisWasserman: Right, that would be appropriate, certainly :) – Jon Skeet Mar 06 '12 at 23:58
  • The majority of my lines are longer than 80 characters. `String.split()` ultimately calls [`String.substring`](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#String.substring%28int%2Cint%29), which just returns a pointer into the same backing `char[]`. `ArrayList` really is just an `Object[]`, and in the general case I do need to resize it. Keeping my own set of unique strings might net some significant savings -- but all of this is just tweaking to get the memory usage down. If I have the memory, shouldn't it just _work_? – Jay Hacker Mar 07 '12 at 14:54
  • @JayHacker: Without having both the code and data file to examine why it's not working, it's very hard to tell what's going wrong. But if your lines are more than 80 characters long, it's entirely possible you're ending up with 160-character arrays backing the strings. Just *try* a `line = new String(line)` and see what happens... – Jon Skeet Mar 07 '12 at 15:07
  • I have measured that this code alone (only reading) consumes 10 times the volume of data input. Java does many copies of objects, the rule is to free references as soon as possible (locally scope variables). If you are doing more stuff after reading, of course you may need ever more memory. – Yves Martin Mar 07 '12 at 15:42
  • Made a mistake, with Java 1.6.0_26, a 6 Mb file (128358 sentences, 1130229 tokens) requires 480 Mb of heap ! – Yves Martin Mar 07 '12 at 16:02
  • Jon Skeet's wish is my command... unfortunately, no dice. Same speed and memory usage, however this time my process just randomly died after 43 minutes. Anybody know what exit status 9 from java means? ;) – Jay Hacker Mar 07 '12 at 16:36
2

You should use the following tricks:

  • Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern()). See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.

  • Use experimental HotSpot options to compact String and char[] implementations and related ones:

    -XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
    

With such memory amount, you should configure your system and JVM to use large pages.

It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.

By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.

Yves Martin
  • 10,217
  • 2
  • 38
  • 77
  • Thanks for the suggestions. I've tried String interning in previous apps; it gets very slow with a lot of data, and it requires a huge PermGen, which really confuses GC. I tried your String optimization options, and it might have decreased memory usage a bit, but it still eventually fills up memory and borks. The large pages idea is a good one; unfortunately, you really have to reboot to get enough contiguous free memory (what is this, DOS? ;), and that memory can't be used for anything else. I'm reading up on GC tuning, and I think I'm going to try the concurrent collector next. – Jay Hacker Mar 07 '12 at 15:37
0

You should check the way how your heap space is splitted into parts (PermGen, OldGen, Eden and Survivors) thanks to VisualGC which is now a plugin for VisualVM.

In your case, you probably want to reduce Eden and Survivors to increase the OldGen so that your GC does not spin into collecting a full OldGen...

To do so, you have to use advanced options like:

-XX:NewRatio=2 -XX:SurvivorRatio=8

Beware these zones and their default allocation policy depends on the collector you use. So change one parameter at a time and check again.

If all that String should live in memory all the JVM livetime, it is a good idea to internalising them in PermGen defined large enough with -XX:MaxPermSize and to avoid collection on that zone thanks to -Xnoclassgc.

I recommend you to enable these debugging options (no overhead expected) and eventually post the gc log so that we can have an idea of your GC activity.

-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:verbosegc.log
Yves Martin
  • 10,217
  • 2
  • 38
  • 77