How to handle large data sets in Java without using too much memory

Question

I'm working in Java. I have the requirement that I must essentially compare two database queries. To do this, I take each row of the result set and assign it to a HashTable with the field name as the 'key' and the data in the field as the 'value'. I then group the entire result set of HashTables into a single Vector just as a container. So essentially to compare two queries I'm really iterating through two Vectors of HashTables.

I've come to find that this approach works really well for me but requires a lot of memory. Because of other design requirements, I have to do this comparison via a Vector-HashTable-like structure, and not some DB side procedure.

Does anyone have any suggestions for optimization? The optimal solution would be one that is somewhat similar to what I am doing now as most of the code is already designed around it.

Thanks

If it is not the same database, this can't be done in SQL whitout merging it all back into yet another database. — Daniel Ribeiro, Aug 24 '10 at 20:56
i'd also use HashMap and ArrayList instead, to at least get rid of all of the synchronization overhead of the old old collection classes... — John Gardner, Aug 24 '10 at 20:59
Thanks for all the suggestions. So far it seems like most people agree with the ArrayList of HashMaps approach. My HashMaps typically have about 8 key-value pairs; key's being about 10 chars long and values no longer than 30-50 characters long. It's not uncommon to compare two ArrayLists with 10,000 HashMaps each. So that's 20,000 HashMaps in memory. Is that too much? Would it make more sense to just load up half of each ArrayList, then dispose of it and load the other half? Only problem is I'd have to dispose and load the halves about 4 times to compare them all. — Tyler, Aug 24 '10 at 21:33

score 6 · Answer 1 · answered Aug 24 '10 at 20:53

6

Specify the same ORDER BY clause (based on the "key") for both result sets. Then you only have to have one record from each result set in memory at once.

For example, say your results are res1 and res2.

If the key field of res1 is less than the key field of res2, res2 is missing some records; iterate res1 until its key field is equal to or greater than the key of res2.

Likewise, if the key field of res1 is greater than the key field of res2, res1 is missing some records; iterate res2 instead.

If the key fields of the current records are equal, you can compare their values, then iterate both result sets.

You can see, in this manner, that only one record from each result is required to be held in memory at a given time.

answered Aug 24 '10 at 20:53

erickson

265,237
58
395
493

i'd go with something like this, a "streaming" comparison, presuming you can order by to get them in the same order. – John Gardner Aug 24 '10 at 21:00
The problem is I'm not actually pulling from two databases but rather a database-like XML document that I have no control over the order of. – Tyler Aug 24 '10 at 21:23
1

Then do the sorting yourself, writing the results to temporary files. If "main memory" isn't big enough, you fall back to mass storage. You know, all those great algorithms developed in the 50s and 60s for working with paper and magnetic tapes... – erickson Aug 24 '10 at 21:27

Noel M · Accepted Answer · 2010-08-24T21:07:58.903

3

Have you looked at the Flyweight Pattern? Do you have lots of equal objects?

Perhaps this pattern might be appropriate for your 'Key', as I imagine the field names are going to be repeated for each row? If they're Strings, you can call intern() so that they'll share the same memory location with other equal Strings, as Strings are immutable.

Another possible optimization - not memory but speed - if concurrency is not an issue would be to use an ArrayList rather than a Vector - as they are not synchronized so accesses should be a little faster. Similarly, HashMap isn't synchronized and Hashtable is, so using the former might be faster too.

edited Aug 24 '10 at 21:07

answered Aug 24 '10 at 20:42

Noel M

15,812
8
39
47

1

Careful with intern() - you might overflow that area. – Thorbjørn Ravn Andersen Aug 24 '10 at 20:49
@Thorbjørn : I had that problem once, and increasing the perm gen solved it in my specific case: http://stackoverflow.com/questions/3094925/trying-to-solve-15-puzzle-outofmemoryerror/3095101#3095101 – OscarRyz Aug 24 '10 at 20:53
1

Is that a permanent solution or will the perm gen increment need to be adjusted if the resultsets grow even larger? – Thorbjørn Ravn Andersen Aug 24 '10 at 20:56
There are typically a lot of equal objects. I'll check out that pattern. The key names are indeed repeat a lot and often the value as well. The Hashtables that end up matching are never in the same memory location so I end up comparing key-values directly to find matching tables. – Tyler Aug 24 '10 at 21:23
@Thorbjørn Works most of the times ( well I have only needed it 3 fortunately! :P ) But, of course, you're still limited to your physical RAM capacity. So, if you have 1g for perm gen and you create 2 gb of strings, you'll get out of memory anyway ( which you would have way lot sooner if you don't do it ) See the graphs. In that particular case the strings repeated **A LOT** so 25 mb of perm gen were needed where 2gb of non-interned string wasn't enough before. For strings that differ too much, this doesn't work ( just like you point out ) – OscarRyz Aug 24 '10 at 21:38

score 2 · Answer 3 · edited Aug 06 '19 at 15:51

You don't specify what kind of comparison do you need, but I would reduce the amount of data held by the HashMap/Vector by transforming the row information into a single hash number.

Something like this:

class RowHash {
    private final int id;       // the row id 
    private final int hashCode; // summary of the whole row info 

    public RowHash( ResultSet rs ) {

        this.id = rs.getInt("id");
        // get the strings from all the data 
        this.hashCode = new StringBuilder()
                       .append( rs.getString("field1") )
                       .append( rs.getString("field2") ) 
                       .append(rs.getString("fieldN"))
                       .toString().hashCode();
    }
    public final boolean equals( Object other ) { 
        return this.hashCode() == other.hashCode();
    }
    public final int hasCode() {
       return hashCode;
    }   
}

And then store it into an ArrayList instead of a Vector which is not synchronized.

 ... 
 ResulSet rs = ... 
 while( rs.next() ) {
     arrayList.add( new RowHash( rs ) );
 }

Well that's the idea, ( and depending on the comparison you need ) is to compute a number representing the whole record, and then use that single number to see if the other query has it.

Bear in mind that this is just a concept, you'll have to modify it to suit your needs.

Another ( probably simpler ) way to reduce the amount of memory used by a program that uses a lot of strings, is to call intern() .

See this answer to compare the impact, but really it depends in your data.

Heres a before/after screenshot using intern on that answer

_Before

_After

Area in blue is memory used, in the first around 2gb in the second < 25 mb

Cool. I think parts of this could be helpful, but two HashTables are considered the same if there are certain key fields that match, so I think hashing the contents like that wouldn't work for me. Furthermore, I reference the keys in other parts of the program so I'd need the hash you suggest PLUS all the regular keys. — Tyler, Aug 24 '10 at 21:27
Yeap, it is very hard to answer without specific requirements. Probably you could do this first, without Map ( HashMap ) and get the id's for the the different rows, so instead of holding gigs in memory, you'll only hold the id's and then re-fetch only those records. Again, it depends on your particular needs. Good luck. — OscarRyz, Aug 24 '10 at 21:31

score 1 · Answer 4 · answered Aug 24 '10 at 21:21

1

If you can sort both of the queries results, you should adapt sorted-merge join algorithm.

answered Aug 24 '10 at 21:21

Skarab

6,981
13
48
86

卢声远 Shengyuan Lu · Answer 5 · 2010-08-25T04:30:01.933

1

You could encapsulate your own Object, for instance, a 'MyRecord' which is smaller than a HashMap, then it will be a List of 'MyRecord'.

If you have to use HashMap, use new HashMap(7,1) instead of default constructor, that could save memory, since you said fixed '8 key-value pairs' in a map

edited Aug 25 '10 at 04:30

answered Aug 25 '10 at 01:57

卢声远 Shengyuan Lu

31,208
22
85
130

score 0 · Answer 6 · answered Aug 24 '10 at 20:54

If you do not have the memory you will need external storage backing your datastructure, which is hard to do correctly (maps of weak references to your data, which all need to be rolled out to disk, etc), and you probably still will end up with bad performance when scaling.

If you really have lots and lots of data, I would suggest embedding a SQL database. Then you can generate two tables containing your data and ask the database to find out any differences, and drop the tables afterwards. I've previously played with Derby, which I found nice, but others exist.

score 0 · Answer 7 · answered Aug 24 '10 at 21:30

If your dataset does not fit in to memory, then do an external sort, and after then the sort-merge join, as already pointed out in another answer.

If your dataset does fit in to memory, then just use a lot of memory - it's fastest that way.

Or if you are interested in specific optimizations just doing what you already do a little bit better - I can't help you.

How to handle large data sets in Java without using too much memory

7 Answers7