Multiple indexing with big data set of small data: space inefficient?

Question

I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go). I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short). Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).

I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.

Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.

If you have several gigabytes of data, chances are you need a more complex system to handle it efficiently. Unless you have a machine with quite a bit more memory than the size of the data, you'll need to do caching and related. Boost's `multi_index` container is good, but it's not efficient in terms of space, and probably won't be very useful unless you have enough memory to support it. — Collin Dauphinee, Jul 22 '11 at 15:50
I can add several Go of RAM if needed, however I think there's a low limit on the memory a program can address on a 32bit system... — Sam, Jul 22 '11 at 16:01

Joaquín M López Muñoz · Accepted Answer · 2011-07-23T07:46:57.643

With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.

By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:

Multiattribute querying with Boost.MultiIndex

Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:

The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.

How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got

32+6*12 = 104 bytes + padding.

Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)

Joaquin, it's a great honor to have your very own answer. I've spent some time studying your documentation and I must say I really appreciate the efforts you put to make this subject accessible. Thank you very much for your contribution! — Sam, Aug 07 '11 at 10:31

score 1 · Answer 2 · answered Jul 22 '11 at 15:50

1

B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .

Article evaluating when to use B-Tree and when to use Bitmap

answered Jul 22 '11 at 15:50

DumbCoder

5,696
3
29
40

score 0 · Answer 3 · answered Jul 22 '11 at 15:56

0

It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.

If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.

answered Jul 22 '11 at 15:56

Tom Kerr

10,444
2
30
46

THanks for your input. To clarify: – Sam Jul 22 '11 at 16:12
I am writing the soft accessing the data as well, so any combination structure / algorithm might be considered. The will be many transversal extractions of subsets. Compile time is not a problem once in production. I think next_permutation could help presenting the keys in whatever order. However, I'm trying to be fast which means - I think - I should try to iterate through valid data as much as possible ( not retrive 100^3 times (key1, key2, key3, key4=000) for each key1/2/3 values )==> use (key4 only as key in this case) – Sam Jul 22 '11 at 16:20

Multiple indexing with big data set of small data: space inefficient?

3 Answers3