Appropriate data structure for storing large number of objects retrievable by sparse identifier

Question

I guess I'm looking for a sparse array implementation, but I really need this to be efficient in terms of memory usage, and one peculiarity of my data that an implementation could take advantage of is that the indices are populated such that if the value for an index i is present, the indices i-1 and i+1 are also likely to have values present, and similarly if the value for i has no value present, i-1 and i+1 are likely to not have values present.

I'm working in Java, and I need the index type to be long rather than the more usual int, if this makes a difference. I have approximately 50 million objects that will need to be stored. I've looked into Trove4J's TLongObjectHashMap, unfortunately this will require around 1.6GB for the hash table alone, and I really need to improve on this.

Can anyone point me towards something that can optimize for long runs of sequentially allocated identifiers? Logarithmic performance of insert/get is acceptable to me, so perhaps something tree-based?

I'm not familiar with Trove4J, where do those 1.6 GiB for the hash table come from? With a load factor of 80% and 64 bit references, an open addressing hash table should fit into 915 MiB (50 million * 1.2 * (64 + 64) bit). If you explicitly store all keys and references (seems necessary for good performance), the information theoretic minimum is 50 million * (64 + 64) bit = 762 MiB. — , Sep 01 '13 at 16:26

score 0 · Answer 1 · answered Sep 01 '13 at 15:47

0

Maybe you could use a database instead of an array ? An in-memory embedded databse like h2sql!

answered Sep 01 '13 at 15:47

GerritCap

1,606
10
9

1

Database doesn't have smaller memory footprint than dedicated data structure (especially if the database uses that structures by itself)! – usamec Sep 01 '13 at 15:48
I haven't benchmarked it, but my suspicion is that the serialization/deserialization overhead would add a significant overhead to the calculation I will need to perform with this data (each item will be accessed randomly and potentially many times during the process) that may add hours or even days to the time taken to finish the operation. That said, it is something I would consider as a last resort. Do you know if h2sql is happy with databases that exceed 4GB in size? – Jules Sep 01 '13 at 16:04

score 0 · Accepted Answer · answered Sep 01 '13 at 15:49

0

Btrees have quite small memory overhead, so I will try those.

answered Sep 01 '13 at 15:49

usamec

2,156
3
20
27

I've had a look, and haven't seen any in-memory implementations of BTree that would be useful for this problem. Do you have a particular one in mind, or are you suggesting this only as a general approach? – Jules Sep 01 '13 at 16:01
Only as a general approach. There are some implementations mentioned here: http://stackoverflow.com/questions/2574661/existing-implementation-of-btree-or-btree-in-java (do not use redblack tree, it uses much more memory). Since you have 8byte keys, I recommend going for something like 16-32tree (so it will fit one line of L2 cache). – usamec Sep 01 '13 at 20:05

Appropriate data structure for storing large number of objects retrievable by sparse identifier

2 Answers2