1

I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go). I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short). Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).

I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.

Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.

Sam
  • 117
  • 11
  • If you have several gigabytes of data, chances are you need a more complex system to handle it efficiently. Unless you have a machine with quite a bit more memory than the size of the data, you'll need to do caching and related. Boost's `multi_index` container is good, but it's not efficient in terms of space, and probably won't be very useful unless you have enough memory to support it. – Collin Dauphinee Jul 22 '11 at 15:50
  • I can add several Go of RAM if needed, however I think there's a low limit on the memory a program can address on a 32bit system... – Sam Jul 22 '11 at 16:01

3 Answers3

4

With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.

By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:

Multiattribute querying with Boost.MultiIndex

Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:

The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.

How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got

32+6*12 = 104 bytes + padding.

Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)

Joaquín M López Muñoz
  • 5,243
  • 1
  • 15
  • 20
  • 1
    Joaquin, it's a great honor to have your very own answer. I've spent some time studying your documentation and I must say I really appreciate the efforts you put to make this subject accessible. Thank you very much for your contribution! – Sam Aug 07 '11 at 10:31
1

B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .

Article evaluating when to use B-Tree and when to use Bitmap

DumbCoder
  • 5,696
  • 3
  • 29
  • 40
0

It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.

If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.

Tom Kerr
  • 10,444
  • 2
  • 30
  • 46
  • THanks for your input. To clarify: – Sam Jul 22 '11 at 16:12
  • I am writing the soft accessing the data as well, so any combination structure / algorithm might be considered. The will be many transversal extractions of subsets. Compile time is not a problem once in production. I think next_permutation could help presenting the keys in whatever order. However, I'm trying to be fast which means - I think - I should try to iterate through valid data as much as possible ( not retrive 100^3 times (key1, key2, key3, key4=000) for each key1/2/3 values )==> use (key4 only as key in this case) – Sam Jul 22 '11 at 16:20