B-Trees / B+Trees and duplicate keys

Question

I'm investigating the possibility of putting together a custom storage scheme for my application. It's worth the effort of potentially reinventing the wheel, I think, because both performance and storage efficiency are a main objective and the data and operations on it are far simpler than everything provided by an RDBMS (no updates, no deletes, predefined set of queries).

I'm using just a small handful of web resources I've found about B-Trees and B+-Trees - Wikipedia, http://www.bluerwhite.org/btree/, http://slady.net/java/bt/view.php, http://www.brpreiss.com/books/opus6/html/page342.html (the last one is the most valuable).

Duplicate keys

The first problem I'm trying to solve is how to deal with duplicate keys - this tree will be acting as a DB index and for example there won't just be one 'thing' with 'color=red', so looking up 'red' in this tree should yield many results.

There are two solutions I have come up with so far. The first is simply having multiple entries in the tree for each of these. But when there are 100,000 or 1,000,000 'red' things in the tree.. is that very efficient for a tree structure? The second was to have just one entry for each key, but the 'payload' associated with each key points to a different block of data, which is a linked list pointing to all instances of items that are 'red'.

Is there a common / better option?

B+Tree nodes changing types

I wanted to check an assumption I'm making. Say you have a B+-Tree, height 2 - the external (leaf) nodes at level 2 hold 'actual data'. Then an insertion necessitates a split of a leaf node - the leaf node no longer holds 'actual data'. Am I right in thinking that in implementation terms because the data might be of a substantial size that you would instead store a kind of 'pointer' as the 'actual data' - so if a leaf node becomes a branch node, that pointer (of the same size) is instead updated to point to the new subtree?

By that I mean, internal and external nodes, they should be the same size really since external nodes might become internal ones, and shuffling data around isn't a good idea?

(Added the C# tag since I'm implementing this from scratch in C#.)

Yes, and the implementations I have looked at so far have not been appropriate. There is another big advantage for the app if I can very specifically control the storage mechanism and partition data in a particular way. Thanks for the comment but I would rather discuss the data structures and associated issues in this case :) — Kieren Johnstone, Aug 03 '11 at 08:28
sorry if this may seem to be offtop. But usually reinventing a wheel ends badly (from my practise). What I am trying to say, is that you may underestimate the effort that you are about to take to implement and test the storage. Why have you chose C# instead of C/C++ (performance). Have you considered [other](http://en.wikipedia.org/wiki/List_of_data_structures) data structures? Do you have testing facilities that will make sure you solution is stable, reliable, efficient and doesn't loose data? To me such effort counts in years for a dedicated team. — oleksii, Aug 03 '11 at 08:39
I know it's usually a bad idea, which is why I qualified it so fully in the question. The language is irrelevant, as a good algorithm will make massive difference vs a few milliseconds here are there in processing overhead from using a different language. This will be scaled out across multiple machines which would negate that anyway. Of course I have researched other data structures, and this is a part of my ongoing research. Yes, there is an extensive testing framework I have started to build. No, it will not take years or a dedicated team. Any more questions? — Kieren Johnstone, Aug 03 '11 at 08:43
You seem to have clear understanding of what you are about to do. Will be nice to see if anyone can answer this. :) — oleksii, Aug 03 '11 at 11:18
Here's some insight to whoever looks at 'NoSQL' and keeps voting it up - **if you don't know** the answer to the specific questions I am asking, don't assume I don't know what I am talking about, and suggest the only thing you can think of. Instead have 'good faith' and try to answer the actual question. This is not a meta-question, this is a question, and I would love some answers :) — Kieren Johnstone, Aug 03 '11 at 17:25
I suggest you try posting your question at http://cs.stackexchange.com/ - being more about general data structures and not specific implementation details. I had good results there for a (very different) general question about algorithms. — MiMo, Sep 29 '12 at 00:20

score 7 · Answer 1 · answered Sep 29 '12 at 00:08

Kieren, I'm sure you figured out by now that B+ trees grow by splitting upwards, so that a leaf node is always a leaf node, and internal nodes are always internal nodes. Eventually, you must split the root node, which turns that into two internals, and you define a new root. So to answer the second part of your question, you don't change node types.

Regarding the first part of your question, when you delete a data record from the DB, you will need to find all the keys that point to that particular record, and remove them. If you have to look through long linear lists to do that, deleting will be slow. I am assuming you are using a binary search within a node in order to quickly find the correct node element (key + pointer), so if you make that "node searching" mechanism include the ability to ask for a particular key + pointer combination, you can quickly find the correct key element to remove. In other words, make the data record pointer part of the search (only when searching for a particular data record's key). This does mean that the duplicate keys will be stored in the nodes in "data pointer" order, so as long as ordering of the duplicate keys is not important, this mechanism will work.

score 5 · Accepted Answer · answered Aug 06 '11 at 22:42

Attempting to answer my own question.. I would welcome other answers too.

Duplicate Keys

The tree will store a reference to a list (memory) or linked-list (disk) of items with the given key, if duplicate entries for the same key is a possibility.

B+Tree nodes, changing types

In-memory, my nodes have an object reference, which can point to another node (in itself another valid B+Tree) in the case of an internal/branch node, or indeed data directly in the case of an external/leaf node. On disk, this would work in a very similar way: a 64-bit value for each 'link slot', as I have chosen to name them - either an offset in the file if pointing at a sub-node, or a block number if pointing to data directly (or the head of a linked-list in the case mentioned in the first part of the question).

doug65536 · Answer 3 · 2016-04-22T05:15:07.437

The main feature of B+ trees is minimizing disk seeks. If you just "store pointers" then you defeat that benefit, and your code will be chasing file pointers, and you will be seeking the disk head all over the place. You can't read a hundred bytes from disk, all reads and writes are in aligned blocks.

Leaf parents: data are always in a leaf, and just one key per leaf are in the nodes (that are immediate parents of the leaves). The node lets you "peek" at the leaf's contents by looking at a copy of the first key in that leaf, right there in the node.

Node parents: The key before the first key in the child node is in the node's parent.

The duplication of data is not bad. For example, if there are 207 records per leaf, and there are 268 records per node, then you will store one extra key for every 207 records. If you have more than 207*269 leaves, then you need one more key per 207*269 records.

You seem to be confusing B-trees and B+ trees. B+ trees always have the data in the leaves, and there is never any data in nodes. Only a sample of the lowest key per child is present in a node. Data never "moves up" a B+ tree, only copies of one lowest key per child are propagated upward.

The overhead grows logarithmically. The minimal duplication saves a lot of seeks.

(Really) Duplicate keys

To handle duplicate keys in a B+ tree, as in multiple rows that have the same value, implementations typically force it to be unique by appending an additional hidden column to the table, and assigning it an auto-incrementing value when a record is created. The hidden column is added to the end of the indexed key, which guarantees that it will always be unique. The index starts with the indexed column(s) so the ordering will be as specified, but the appended hidden value guarantees uniqueness.

If the table already has a primary key, then it could just use that to enforce uniqueness instead of adding a hidden column.

Duplicate keys violate the basic invariants of either type of tree and thus kill performance (which is built on the invariants); they just aren't worth the hassle. That's why real implementations - as opposed to academic/amateurial - either make keys unique in artificial ways, or they store lists of record pointers (e.g. primary keys) for keys in non-unique indexes, or they use other schemes like bitmaps. How would one name (refer to) a record anyway without a key to identify it? In any case, given the OP's knowledge of the subject he is reinventing your answer is like a drop in the desert... — DarthGizka, Apr 20 '16 at 20:49
@DarthGizka I was not referring to duplicate keys as in non-unique indexes, I was referring to B+ tree nodes containing a copy (duplication) of the lowest key in its descendent. Funny you should mention that though, I am actually working on a B+ tree implementation right now, and it does not handle duplicate keys yet, so I will be artifically making them unique with an added autoincrementing hidden column. If I were less rational, I might think you were psychic. — doug65536, Apr 22 '16 at 05:03
@DarthGizka Thanks though, I made an edit to address the non-unique-index part of the question. — doug65536, Apr 22 '16 at 05:15

score 1 · Answer 4 · answered Feb 18 '15 at 19:10

When you deal with duplicate keys you always hit a leaf containing the given key you searched for.

Since a leaf groups all keys together you just need to walk the leaf to the left (position -1) to find the first entry with the given key. If you find the first key of the leaf, you need to inspect the previous leafs.

Since there is no possible assumption about the leaf you will hit for a duplicated key, you need to walk all previous leafs until you find a leaf which first key is not the key you search. If the last key of that leaf is not the key you search for (<) than its the next leaf otherwise this leaf.

The search over the leafs is linear inside the leaf you have log n to find the first key entry.

If you can sort the key entries in the leaf by the data they hold you can easily find the leaf to find a certain entry (which is great for contains and remove operations).

for a high chance of duplicates it is best to look for an other storage model by storing key -> datas. Especially if the data does not change often.

[Update]

There is a chance of forgetting a key:

Node N [L1 |3| L2] Leaf L1 [1,2,3] -> L2 [ 3, 4, 5]

You remove the 3 in L2 you result in.

Node N [L1 |3| L2] Leaf L1 [1,2,3] -> L2 [ 4, 5]

When you now search you find that 3 is not in L2. You might now look in the previous leaf to find the 3.

Another way is to update the key to the actual first key of the leaf, resulting in (resulting in potential update propagation):

Node N [L1 |4| L2] Leaf L1 [1,2,3] -> L2 [ 4, 5]

Or you borrow from the left leaf the element.

Node N [L1 |3| L2] Leaf L1 [1,2] -> L2 [3, 4, 5]

I tend to use the first solution since it works also for leafs in the middle of multi leaf duplicates.