Choosing a Data structure for very large data

Question

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.

So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)

Could i get some advise on which data structure might be suitable for my situation

Are you trying to keep this entire structure in memory? Databases commonly use B-tree for that kind of search. The structure is stored on disk and it takes only a small number of accesses to find the desired key even with a very large number of keys in the index. — JOTN, Nov 24 '10 at 01:52
@JOTN: CPU cache line fills can have the same effect on performance that database page reads do, albeit at microsecond rather than millisecond scale. — Jeffrey Hantin, Nov 24 '10 at 02:02
if you are going to use a Self-Balancing Tree then I strongly recommend you to read this paper: http://web.stanford.edu/~blp/papers/libavl.pdf — anilbey, Jul 30 '14 at 12:06

score 7 · Accepted Answer · answered Nov 24 '10 at 01:55

The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.

Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!

AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.

score 2 · Answer 2 · answered Jan 05 '13 at 16:00

2

Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.

answered Jan 05 '13 at 16:00

gsb

41
1
5

fmt · Answer 3 · 2010-11-24T14:33:18.687

2

Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.

edited Nov 24 '10 at 14:33

answered Nov 24 '10 at 01:55

fmt

993
9
18

shouldn't it be to choose `m` to be around 8-10 instead of `n`? – lijie Nov 24 '10 at 02:01

score 1 · Answer 4 · answered Nov 24 '10 at 01:55

1

If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.

A map where the key is the int, and the value is the name.

answered Nov 24 '10 at 01:55

Michael Peddicord

459
1
5
15

1

Not to be rude or anything, but as I am assuming that his table is sparse, wouldn't that require a ridiculous amount of memory? – fmt Nov 24 '10 at 01:58
1

Oh definitely, it would take a ton of memory. But I did qualify that statement with an "If memory isn't an issue"... just an idea. – Michael Peddicord Nov 24 '10 at 02:05
how can i calculate the ammount of memory i will need, in this case how much memory will your implementation take. Is there anyway to calculate that? – Carlos Nov 24 '10 at 02:11
By map do you mean some (variant on) bitvector (in this case)? I can't really think of any other guaranteed O(1) structure. Specifically, not a map as implemented by a tree. – lijie Nov 24 '10 at 02:24
a map means just something with a key and a record. even a linearly-searched list complies. you're probably talking about a hash table, or "hash map" as known on some libraries. – Javier Nov 24 '10 at 02:40

Javier · Answer 5 · 2010-11-24T02:45:16.670

Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).

If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.

If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.

edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

Choosing a Data structure for very large data

5 Answers5