10

It appears to me that one way of storing data in a B-tree as a file can be done efficiently with C using binary file with a sequence (array) of structs, with each struct representing a node. One can thus connect the individual nodes with approach that will be similar to creating linked lists using arrays. But then the problem that props up would be deletion of a node, as erasing only a few bytes in the middle in a huge file is not possible.

One way of deleting could be to keep track of 'empty' nodes until a threshold cutoff is reached and then make another file that will discard the empty nodes. But this is tedious.

Is there a better approach from the simplicity/efficiency point of view for deleting, or even representing a B-tree in a file?

TIA, -Sviiya

user203405
  • 201
  • 1
  • 3
  • 9

3 Answers3

5

For implementing B-Trees in a file, you can use the file offset instead of pointers. Also, you can implement a "file memory manager", so that you can re-use deleted items in the file.

In order to fully recover the deleted blocks in a B-Tree file, you will have to recreate the B-Tree in a new file. Also remember the most OSes have no methods for truncating files. A portable method for truncating a file is to write a new file and destroy the old.

Another suggestion is to partition the file into B-Tree partition and data (item) partition. A B-Tree partition will contain the pages. The leaf pages will contain offsets to the data items. The data partition will be a section in the file containing data items. You may end up creating more than one of each partition and the partitions may be interleaved.

I spent much time playing with a file based B-Tree, until I gave up and decided to let a database program (or server) handle the data for me.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • Sounds interesting. This exercise of mine is to get some exposure to low level coding. I am mainly interested in Linux based systems and it supports file truncation. :) – user203405 Feb 04 '10 at 06:06
  • 2
    Most OS's *do* have functions to truncate files. In Linux, BSDs, Windows you can set the file length to whatever you like. – Zan Lynx Oct 28 '11 at 17:42
3

I did a very quick search and dug up this: http://people.csail.mit.edu/jaffer/WB C source: http://cvs.savannah.gnu.org/viewvc/wb/wb/c/ - it seems to offer disk-based B-tree style databases - although taking a look at "delete.c" it seemed to imply if you delete a node everything down from it would be taken out - if that's the correct behaviour then it sounds like something that might help?

Also - B-trees are often used in filesystems - could you not take a look at some filesystem code?

My own inclination is that of a file-system - if you have a B-tree of fixed-size, whenever you "delete" a node rather than attempting to remove the reference, just set the value to whatever means nothing in your code. Then, have a clean-up thread running that checks if anyone has the file open for reading and if all's quiet blocks the file and tidies up.

  • Thanks for the reference, Ninefingers. :) Will certainly have to read it. Because deletion can be frequent, accounting for them should be efficient. I would expect some of these operations could perhaps be delayed, but I would need to read the code to see if there is a better option. I also intend to use it for a filesystem later, but then the implementation would be different as the size would be constant. So the design will need to take that into account. – user203405 Feb 02 '10 at 19:38
  • Hmm I agree. That code claims to do what you need and a cursory glance at viewcvs suggests it might - without sitting down and rebuilding your problem though it's hard to tell... I think filesystems do just "zero" elements they wish to delete and assign to any zero element but I could have that wrong. Either way, if this doesn't answer it please do open up the question again! –  Feb 02 '10 at 19:58
  • The questions does answer what I was looking for, and I already found out about file truncating and hence the problem of deleting data from the middle is circumvented. Thanks. :) – user203405 Feb 04 '10 at 05:46
1

You can use Berkley DB as well. It works well with C programs and implements B+ tree.

Jack
  • 1,398
  • 3
  • 16
  • 26
  • Yep, but I want to write my own code to get the real feel. :) – user203405 Feb 04 '10 at 06:05
  • Agree. Writing on your own is fine to get the real feel. BBD is very sophisticated database and provides many features that normal code wouldn't. In case of actual product deployment, I would choose BDB. Reinventing wheel would be difficult here. – Jack Feb 04 '10 at 21:28