What's the most efficient way to represent red-black trees?

Question

Okasaki uses (essentially)

data Color = R | B
data RB a = L | T {-# UNPACK #-}!Color !(RB a) !a !(RB a)

I know that in C, the color is typically handled in a more fiddly way to save space, doing something like making the low bit of a pointer represent color (I think usually the pointer to a node encodes its color, but it would also be possible to mimic Okasaki's structure by making the left or right pointer from a node represent its color).

Obviously, such bit-fiddling is impossible in Haskell. How, then, can the nodes be represented most efficiently in Haskell?

data RB' a = L | B !(RB a) !a !(RB a) | R !(RB a) !a !(RB a)

seems likely to be reasonably memory efficient, but it also seems likely to make pattern matching rather verbose.

This is a bit old but I think it still applies for GHC: http://stackoverflow.com/a/3256825/482696 — Danny Navarro, Feb 27 '14 at 18:49
One very simple idea might be to get rid of the `Color` type and use different constructors for red and black nodes. Another, more advanced idea is to [use the type system](https://gist.github.com/michaelt/2660297) ([see also here](http://blog.piechotka.com.pl/2013/04/10/statically-typed-red-black-trees/)). This requires lots of GHC extensions, and I don't know how well the compiler optimizes these, but it's worth looking at—there's a chance that there's a way to optimize this type of implementation by recording the color only on *some* nodes of the tree, close to the top. — Luis Casillas, Feb 27 '14 at 23:28
Why Red-Black trees at all? They are not really one of the faster BBSTs. You might be interested in the [B-tree representation](http://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Analogy_to_B-trees_of_order_4) of RB trees. You might also want to look at different layouts such as splay trees, which don't need coloring. Treaps are also cool and very fast, but they need randomization, so you'd have to run the insert operations in a state monad. — Niklas B., Feb 28 '14 at 18:28

score 2 · Answer 1 · answered Feb 28 '14 at 18:23

Only single constructor data types can be unpacked, and there is no way to have a "generic unpack" for polymorphic constructors. Your single-type construction of a tree, below, will actually be stored using pointer tagging. It has 3 constructors, one of which is the empty and will not contain any dereferences. As an aside, there seems to be an opportunity for GHC to optimize, but I don't think it does. Theoretically data Foo = A | B | C | ... Z could be represented as 26 distinct, reserved pointer values. I digress, however.

data RB' a = L | B !(RB a) !a !(RB a) | R !(RB a) !a !(RB a)

The above type will be represented as a tagged pointer, and pattern matching will be very efficient. I think this is what you were referring to when you mentioned memory. If you know the value of a, you could use associated data types (data families) to write more efficient constructors. A wonderful resource on this is Don Stewart's article Self-optimizing data structures: using types to make lists faster.

Data families would allow you to express something akin to this:

class AdaptRedBlackTree a where
  data RBTree a

  empty :: a
  insert :: a -> Tree a -> Tree a
  ...

instance RedBlackTree Int where
  data RBTree Int = RBEmptyInt 
                  | LInt (RBTree Int)
                         {-# UNPACK #-} Int
                         (RBTree Int) 
                  | RInt (RBTree Int)
                         {-# UNPACK #-} Int
                         (RBTree Int)

Unfortunately, further unpacking will be difficult, but at least you can avoid dereferences on the Int values.

I wasn't trying to unpack anything but the color field. That should be okay, right? — dfeuer, Feb 28 '14 at 19:54
I am not entirely sure, but I believe if you pack the color field like your first example, Okasaki's style, you won't get the advantage of the bit fiddling that you refer to. Your pointer to your data type will be tagged either `L` or `T`, and the `T` constructor will contain essentially a boolean argument. The Stackoverflow response to [Can GHC unpack enumerations on strict fields](http://stackoverflow.com/questions/12873894/can-ghc-unpack-enumerations-on-strict-data-fields) corroborates this. — Aaron Friel, Mar 01 '14 at 01:30

What's the most efficient way to represent red-black trees?

1 Answers1