bulk loading data into a b+tree

Question

I have built a b+tree index of my own, with all the operations for insert/delete/search over the index. To accelerate the insertion of a huge dataset, I would like to implement bulk-loading as well to be able to experiment with large datasets.

What I have been trying to do is to sort the data and start filling the pages at the leaf level. keys are copied or pushed at the upper levels once necessary. I always keep track of the frontier of the index at various heights. For example, if my index is of height 3 (root, one level containing internal nodes and the leaves level), I only keep 3 pages in the memory and once they get full, or there is no more data, I write them to the disk.

The problem is how much data to write to each page to maintain the page limits of all individual nodes. These limits can be found here. I could not find any useful resource that has details on implementation of bulk loading or a good strategy for deciding what fill ratio to use in order to guarantee node limits.

Any ideas?

Why don't you fill all the pages until they are full? The theoretical limits don't matter in practice as long as the tree does not degenerate into pathological cases. — usr, Aug 19 '12 at 20:44
If I keep filling each page into its full capacity in many cases I will end up with an unbalanced B+tree. The keys need to be approximately uniformly distributed across the index both horizontally and vertically and that's why the theoretical limits exist. — Pirooz, Aug 19 '12 at 22:08
The link you provided says the max capacity of any node is b, which means it is maximally full. I cannot come up with a data set where just filling the leaves until they are full leads to an unbalanced tree. Can you give an example? — usr, Aug 19 '12 at 22:31
That's easy to come up with. With b=4, the capacity of each leaf is 3 keys. If the B+tree has 4 keys, filling the first leaf with 3 keys, leaves the last key with only 1 key and the tree unbalanced. Also, the minimum number of keys per node cannot be less than 2. — Pirooz, Aug 21 '12 at 21:17

score 2 · Answer 1 · answered Aug 21 '12 at 21:58

From the comments under the question I can tell that your concern is that the last page (or last pages if considering ones higher up in the tree) might not reach the minimum fill count.

As the number of such pages is bounded by log2(n) (the height of the tree) I suspect that the theoretical performance guarantees are unaffected.

Anyway, the guarantees you linked to are not required for correctness. They are sufficient for guaranteed bounds on running time. They are not necessary for guaranteed running time though (example: add one page with one row to the end of the b-tree - you still get the same guaranteed running times).

If you want to know how real b-trees operate, you might want to take a look at your favorite RDBMS (as a SQL Server user I know that SQL Server happily under-runs the 50% page-fullness guarantee without practical impact). I think you'll find that theoretical concerns are treated as not very meaningful.

That's right! If you can manage to keep your tree balanced, the %50 fill ratio would not be a very big concern. Specially, if you have a large branching factor. But the question is originally on how to build a balanced B+tree in the first place, using bulk loading. Well, I have coded a B+tree from scratch and have solved this problem in one way, counting the number of buckets I have in each level and distributing my keys uniformly across the buckets, but was still curious what the standard practice is. — Pirooz, Sep 18 '12 at 17:35

bulk loading data into a b+tree

1 Answers1