Java data structure of 500 million (double) values?

Question

I am generating random edges for a complete graph with 32678 Vertices. So, 500 million + values.

I am using a HashMap to using the edges as key and the random edge weight as the value. I keep encountering:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringBuilder.toString(StringBuilder.java:430) at pa1.Graph.(Graph.java:60) at pa1.Main.main(Main.java:19)

This graph will then be used to construct a Minimum Spanning Tree.

Any ideas on a better data-structure or approach?

I know there are overrides to allocate more memory, but I would prefer a solution that works as-is.

I've thought of that-- it is a HW Problem, and I don't think that a DB is the correct route. Just feel like I may be missing something here... — quannabe, Mar 02 '13 at 06:28
umm, correct me if I'm wrong, but `500 000 000 * 32 bits = 1.86264515 gigabytes` default Xmx for java is no more than 128mb. — Denis Tulskiy, Mar 02 '13 at 06:30
Sounds like you need to consider adjacency matrix for such a densely populated graph, rather than HashMap. Although that's still several GB for the one map. — Billy ONeal, Mar 02 '13 at 06:30
(Also note that a double is 64 bits, not 32. So that'd be 64 * 32768 * 32768 = 68719476736 bits = 8 589 934 592 bytes (8.5 GB). A hash table is going to be at least 20% worse than this, and probably more) — Billy ONeal, Mar 02 '13 at 06:32
@BillyONeal That's pretty huge! Like I mentioned, this is a HW problem-- we generate (large) complete graphs and then find the Minimum Spanning Tree. It is obviously do-able, just having trouble getting started! — quannabe, Mar 02 '13 at 06:36
@quannabe: You can generate large graphs without making them fully connected. — Billy ONeal, Mar 02 '13 at 06:37
@BillyONeal my mistake, I should have been more clear-- they are required to be complete graphs with 32678 vertices :) — quannabe, Mar 02 '13 at 06:39
@BillyONeal: It's presumably an undirected graph, so we'd be looking at (slightly under) 4GiB and not 8GiB. — Nabb, Mar 02 '13 at 06:41
@Nabb: That's an interesting presumption. (Yes, MST is usually an undirected thing; but the OP didn't mention MST at all when I had posted my comment) But even 4GB is pretty large. — Billy ONeal, Mar 02 '13 at 06:43
Does all this data need to be available at once to construct the MST? If there is such a way to write it such that it can be [parallelized](http://www.hipc.org/hipc2009/documents/HIPCSS09Papers/1569250351.pdf) (first google hit, I really have no clue), maybe something similar can be done for [resident memory] space? — , Mar 02 '13 at 09:37

user949300 · Answer 1 · 2013-03-02T21:22:51.703

A HashMap will be very large, cause it will contain Doubles (with a capital D) which are significantly larger than 8 bytes. (Not to mention the Entry) Depends on implementation and the CPU chip, but I think it's at least 16 bytes each, and probably more?

I think you should consider keeping the primary data in a huge double[] (or, if you can spare some accuracy, a float[]). That cuts memory usage by an easy 2x or 4x. (500M floats is a "mere" 2GB) Then use integer indexes into this array to implement your edges and vertices. For example, an edge could be an int[2]. This is far from O-O, and there's some serious hand-waving here. (and I don't understand all the nuances of what you are trying to do)

Very "old fashioned" in style, but requires a lot less memory.

Correction - I think an edge might be int[4], a vertex an int[2]. But you get the idea. Actually, for edges and vertices, you will have a smaller number of Objects and for them you can probably use "real" Objects, Maps, etc...

vijay · Answer 2 · 2013-03-02T06:51:01.003

3

Since it is a complete graph, there is no doubt on what the edges are. How about storing the labels for those edges in a simple list which is ordered in a certain manner? So e.g. if you have 5 nodes, the weights for the edges which would be ordered as follows: {1,2}, {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}.

However, as pointed out by @BillyO'Neal this might still take up 8 GB of space. You might want to split up this list into multiple files and simultaneously maintain an index of these files suggesting where one set of weights ends in one file and where the next set of weights begin.

Additionally, given that you are finding the MST for the graph, you might want to have a look at the following paper as well: http://cvit.iiit.ac.in/papers/Vibhav09Fast.pdf. The paper seems to based off the Boruvka's Algorithm (http://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm; http://iss.ices.utexas.edu/?p=projects/galois/benchmarks/mst).

edited Mar 02 '13 at 06:51

answered Mar 02 '13 at 06:31

vijay

2,646
2
23
37

This is functionally equivalent to my adjacency matrix suggestion. But that still requires 8+GB of memory for a fully connected graph with 32768 nodes – Billy ONeal Mar 02 '13 at 06:34
Since I am building a minimum spanning tree from this generated graph-- is there a way I can approach this that will let me somehow build the MST as I am generating the edges? Or does that not make sense? – quannabe Mar 02 '13 at 06:38
@quannabe this was not part of your original question. it might help to add the MST requirement/goal in the question. it will help you get better answers. – vijay Mar 02 '13 at 06:45
@quannabe: Both Prim's algorithm and Kruskal's algorithm (the two common ways of calculating MSTs) require building a set of all edges in the graph. – Billy ONeal Mar 02 '13 at 06:46

Java data structure of 500 million (double) values?

2 Answers2