Persistent (purely functional) Red-Black trees on disk performance

Question

I'm studying the best data structures to implement a simple open-source object temporal database, and currently I'm very fond of using Persistent Red-Black trees to do it.

My main reasons for using persistent data structures is first of all to minimize the use of locks, so the database can be as parallel as possible. Also it will be easier to implement ACID transactions and even being able to abstract the database to work in parallel on a cluster of some kind. The great thing of this approach is that it makes possible implementing temporal databases almost for free. And this is something quite nice to have, specially for web and for data analysis (e.g. trends).

All of this is very cool, but I'm a little suspicious about the overall performance of using a persistent data structure on disk. Even though there are some very fast disks available today, and all writes can be done asynchronously, so a response is always immediate, I don't want to build all application under a false premise, only to realize it isn't really a good way to do it.

Here's my line of thought: - Since all writes are done asynchronously, and using a persistent data structure will enable not to invalidate the previous - and currently valid - structure, the write time isn't really a bottleneck. - There are some literature on structures like this that are exactly for disk usage. But it seems to me that these techniques will add more read overhead to achieve faster writes. But I think that exactly the opposite is preferable. Also many of these techniques really do end up with a multi-versioned trees, but they aren't strictly immutable, which is something very crucial to justify the persistent overhead. - I know there still will have to be some kind of locking when appending values to the database, and I also know there should be a good garbage collecting logic if not all versions are to be maintained (otherwise the file size will surely rise dramatically). Also a delta compression system could be thought about. - Of all search trees structures, I really think Red-Blacks are the most close to what I need, since they offer the least number of rotations.

But there are some possible pitfalls along the way: - Asynchronous writes -could- affect applications that need the data in real time. But I don't think that is the case with web applications, most of the time. Also when real-time data is needed, another solutions could be devised, like a check-in/check-out system of specific data that will need to be worked on a more real-time manner. - Also they could lead to some commit conflicts, though I fail to think of a good example of when it could happen. Also commit conflicts can occur in normal RDBMS, if two threads are working with the same data, right? - The overhead of having an immutable interface like this will grow exponentially and everything is doomed to fail soon, so this all is a bad idea.

Any thoughts?

Thanks!

edit: There seems to be a misunderstanding of what a persistent data structure is: http://en.wikipedia.org/wiki/Persistent_data_structure

can you explain why "My main reasons for using persistent data structures is first of all to minimize the use of locks" ??? Persistent or not, you still need locks... — Mitch Wheat, May 05 '10 at 04:12
Well, you are right. There's still the need to use locks, but it is minimized down to an absolute minimum. For example, on my case, the only places we will need locks is on "weak" references, like the head of the red-black tree. After appending all the tree changes to the file, we have to lock it only to change the pointer (only an int) to the head of the tree. There is no possibility of an unkowing reader to catch the tree in an inconsistent state, and the lock should work really fast. Also for writing, the only time a lock is needed is to change the size of the file (appending data to it) — Waneck, May 05 '10 at 12:21
I've just realized that a purely functional approach, like the one suggested on Okasaki's book, isn't a good choice, since not only the space efficiency is very bad, but also it makes harder to query for a time period, since one will need to perform a check of what has changed from one version to another. — Waneck, May 05 '10 at 13:21
Maybe keeping a log of commits, like git, could solve this, though — Waneck, May 05 '10 at 14:19
uncle brad: If you don't know what a persistent data structure is ( http://en.wikipedia.org/wiki/Persistent_data_structure ), why don't you go look for it first? — Waneck, May 05 '10 at 22:56
You may want to check out [this](http://rethinkdb.com/) (a functional, append-only database) — BlueRaja - Danny Pflughoeft, May 10 '10 at 21:37
@Waneck, I've written [this](http://sourceforge.net/projects/aodbm/) that you may be interested in. — dan_waterworth, Feb 22 '11 at 20:23

Andres Jaan Tack · Accepted Answer · 2010-05-10T21:33:39.070

2

If you find you are getting bottlenecked on write time, or that your durability guarantee is meaningless without synchronous writes (hmm...), you should do what most other databases do: implement a Write-Ahead Log (WAL), or a redo-log.

Disks are actually pretty darn good at writing sequentially, or at least that's what they're best at. It's random writes (such as those in a tree) that are terribly slow. Even flash drives, which beat the hell out of disks for random writes, are still significantly better at sequential writes. Actually, even most RAM is better at sequential writes because there are fewer control signals involved.

By using a write-ahead log, you don't have to worry about:

Torn writes (you wrote half a tree image before the cat ate your power supply)
Loss of information (you didn't actually get to persisting the tree, but Joe thinks you did)
Huge performance hits from random, synchronous disk I/O.

edited May 10 '10 at 21:33

answered May 05 '10 at 23:15

Andres Jaan Tack

22,566
11
59
78

Hey! Thanks for the tip! This is really a must have, since the used memory can easily become full. But in a case where the database is fully temporal (all changed data is recorded), it could actually become only one file! Garbage Collection is I think one of the greatest (but necessary) slowdowns, in this sense. – Waneck May 06 '10 at 13:00
Would you explain why using a WAL means that you don't have to worry about "Huge performance hits from random, synchronous disk I/O"? – dan_waterworth Jan 21 '11 at 09:08
A write-ahead log very infrequently reads/writes data randomly; it's always appended-to, which is fast on a traditional hard disk. Yes, there will be other random writes in the system, but for something that's used on basically every single record update, efficiency will matter. – Andres Jaan Tack Jan 21 '11 at 19:12
yes, you won't have to worry about performance hits writing to the log, but your answer could be read as not having to worry about performance hits in reading/writing at all. – dan_waterworth Jan 22 '11 at 09:01

score 1 · Answer 2 · answered Aug 23 '13 at 01:48

Interesting with someone likeminded :-) I have actually implemented a database that uses a persistant data structure as its data model. A type of persistent B2-tree, I suppose one could call it. Append-only storage to disk and garbage collection - not all history need to be kept forever. One can set a finite retain period to allow the database to forget about early history.

See http://bergdb.com/

score 1 · Answer 3 · answered May 05 '10 at 16:48

1

My thought is that you have a great idea. Now go build the darn thing. From everything you've written, it sounds like you're suffering from an acute case of analysis paralysis.

answered May 05 '10 at 16:48

rtperson

11,632
4
29
36

Hey! I'm really glad that you think so! I'm already coding it, but since this is the first time I'm coding a DBMS, I thought that maybe I could be taking the wrong direction somewhere! Thanks! – Waneck May 05 '10 at 17:25

score 1 · Answer 4 · answered Jan 21 '11 at 09:05

1

I know this question is a little old, but I've been implementing the almost the same thing and what I've found is that, being a binary tree means that the performance is terrible (due to the number of seeks). It is probably a much better idea to try to make a much broader persistent tree despite the extra space overhead.

answered Jan 21 '11 at 09:05

dan_waterworth

6,261
1
30
41

You are completely right. There is actually a good implementation to look after - couchdb's immutable b-tree ! But by now I've changed the direction of this project, and I dropped the need of purely functional data structures on disk, as they aren't really a tight fit in this case. For lock-free structures, it's best to implement a CAS operation on a memory-mapped file. – Waneck Jan 21 '11 at 11:45
@Waneck, yes I had seen couchdb's b-tree (though I haven't delved into the implementation). Would you mind explaining your second comment about lock-free structures? I'm not sure I understand. – dan_waterworth Jan 22 '11 at 09:35
please see http://stackoverflow.com/questions/2846190/cross-platform-and-cross-process-atomic-int-writes-on-file ! After I found out that you can do Compare-and-Swap operations on memory mapped files, it just seemed to me that persistent data structures aren't a very good solution for databases. Append-only means no locality, and poor disk (performance-wise) usage after all. – Waneck Jan 23 '11 at 13:58

Persistent (purely functional) Red-Black trees on disk performance

4 Answers4