Best self-balancing BST for quick insertion of a large number of nodes

Question

I've been able to find details on several self-balancing BSTs through several sources, but I haven't found any good descriptions detailing which one is best to use in different situations (or if it really doesn't matter).

I want a BST that is optimal for storing in excess of ten million nodes. The order of insertion of the nodes is basically random, and I will never need to delete nodes, so insertion time is the only thing that would need to be optimized.

I intend to use it to store previously visited game states in a puzzle game, so that I can quickly check if a previous configuration has already been encountered.

score 4 · Accepted Answer · edited Feb 07 '13 at 17:28

4

Red-black is better than AVL for insertion-heavy applications. If you foresee relatively uniform look-up, then Red-black is the way to go. If you foresee a relatively unbalanced look-up where more recently viewed elements are more likely to be viewed again, you want to use splay trees.

edited Feb 07 '13 at 17:28

Andrew Barber

39,603
20
94
123

answered Aug 05 '08 at 15:59

Louis Brandy

19,028
3
38
29

score 3 · Answer 2 · edited Feb 07 '13 at 14:40

3

Why use a BST at all? From your description a dictionary will work just as well, if not better.

The only reason for using a BST would be if you wanted to list out the contents of the container in key order. It certainly doesn't sound like you want to do that, in which case go for the hash table. O(1) insertion and search, no worries about deletion, what could be better?

edited Feb 07 '13 at 14:40

CloudyMarble

36,908
70
97
130

answered Aug 29 '08 at 00:10

jmbucknall

2,061
13
14

score 0 · Answer 3 · edited Feb 07 '13 at 14:40

0

The two self-balancing BSTs I'm most familiar with are red-black and AVL, so I can't say for certain if any other solutions are better, but as I recall, red-black has faster insertion and slower retrieval compared to AVL.

So if insertion is a higher priority than retrieval, red-black may be a better solution.

edited Feb 07 '13 at 14:40

CloudyMarble

36,908
70
97
130

answered Aug 05 '08 at 15:50

Jonas Kölker · Answer 4 · 2013-02-08T18:25:09.777

-2

[hash tables have] O(1) insertion and search

I think this is wrong.

First of all, if you limit the keyspace to be finite, you could store the elements in an array and do an O(1) linear scan. Or you could shufflesort the array and then do a linear scan in O(1) expected time. When stuff is finite, stuff is easily O(1).

So let's say your hash table will store any arbitrary bit string; it doesn't much matter, as long as there's an infinite set of keys, each of which are finite. Then you have to read all the bits of any query and insertion input, else I insert y0 in an empty hash and query on y1, where y0 and y1 differ at a single bit position which you don't look at.

But let's say the key lengths are not a parameter. If your insertion and search take O(1), in particular hashing takes O(1) time, which means that you only look at a finite amount of output from the hash function (from which there's likely to be only a finite output, granted).

This means that with finitely many buckets, there must be an infinite set of strings which all have the same hash value. Suppose I insert a lot, i.e. ω(1), of those, and start querying. This means that your hash table has to fall back on some other O(1) insertion/search mechanism to answer my queries. Which one, and why not just use that directly?

edited Feb 08 '13 at 18:25

answered Feb 01 '09 at 12:49

Jonas Kölker

7,680
3
44
51

This one is conventional wisdom. Best case, O(1), obviously implementations will vary. There are a variety of different hash table algorithms as well. – ApplePieIsGood Apr 22 '09 at 23:22
"This one is conventional wisdom." -- I've heard it many times, but I still haven't seen a proof. I think it would be good to challenge this piece of folklore if you want the theoretical result "it's O(1)", or measure various lookup structures if you want "fast in practice". "Best case, O(1)" -- unbalanced search trees have that as well, yet no one argues that they have "O(1) insertion and search". – Jonas Kölker May 04 '09 at 07:38
a best case unbalanced search tree will be one node from balanced. Best case insertion/lookup is still log(n) – µBio Nov 20 '09 at 00:55
In the best case, the user is searching for the value stored at the root node, which takes O(1) time to access... – Jonas Kölker Nov 21 '09 at 04:51
@MeNoMore Jonas *correctly* used a quote format for the first line of his answer, because it was a *quote* of someone else. Do not make edits like this in the future. – Andrew Barber Feb 07 '13 at 17:27
@AndrewBarber Next time you have a request you use the word please or contact a moderator. Now to the issue, while your right about this i was conserned by the square brackets in: [hash tables have] do you see any reason for that? this term isnt fount anywhere else on the page? – CloudyMarble Feb 08 '13 at 06:40
@menomore The brackets represent paraphrased content, done for ease of understanding. It's a very common usage. The paraphrase and quote come from boyetboy's answer. – Andrew Barber Feb 08 '13 at 06:53
@AndrewBarber: "Next time you have a request you use the word please or contact a moderator." // next time you correct someone, would you please have the decency to be a good role model for the behavior you want from others? – Jonas Kölker Feb 08 '13 at 18:26
@JonasKölker My apologies; you are correct. Thank you for the reminder :) – Andrew Barber Feb 08 '13 at 19:39

Best self-balancing BST for quick insertion of a large number of nodes

4 Answers4