Why B-Tree for file systems?

Question

I know this is a common question and I saw a few threads in Stack Overflow but still couldn't get it.

Here is an accepted answer from Stack overflow:

" Disk seeks are expensive. B-Tree structure is designed specifically to avoid disk seeks as much as possible. Therefore B-Tree packs much more keys/pointers into a single node than a binary tree. This property makes the tree very flat. Usually most B-Trees are only 3 or 4 levels deep and the root node can be easily cached. This requires only 2-3 seeks to find anything in the tree. Leaves are also "packed" this way, so iterating a tree (e.g. full scan or range scan) is very efficient, because you read hundreds/thousands data-rows per single block (seek).

In binary tree of the same capacity, you'd have several tens of levels and sequential visiting every single value would require at least one seek. "

I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST.

But these nodes are again stored as linked lists right?

I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.

Isn't the same argument hold good for BSTs too? Except that the links will be downwards?

Please someone explain it to me?

XFS's on-disk structure is documented with diagrams and stuff, not just code: http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf. It uses B+ tree for extent maps, to keep track of all the extents for an inode with many. B+ trees are used for several other things, too. — Peter Cordes, Sep 11 '15 at 01:13

score 8 · Answer 1 · answered Jul 26 '17 at 10:38

I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST. I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os. Isn't the same argument hold good for BSTs too? Except that the links will be downwards?

Basically, the idea behind using a B+tree in file systems is to reduce the number of disk reads. Imagine that all the blocks in a drive are stored as a sequentially allocated array. In order to search for a specific block you would have to do a linear scan and it would take O(n) every time to find a block. Right?

Now, imagine that you got smart and decided to use a BST, great! You would store all your blocks in a BST an that would take roughly O(log(n)) to find a block. Remember that every branch is a disk access, which is highly expensive!

But, we can do better! The problem now is that a BST is really "tall". Because every node only has a fanout (number of children) factor of 2, if we had to store N objects, our tree would be in the order of log(N) tall. So we would have to perform at most log(N) access to find our leaves.

The idea behind the B+tree structure is to increase the fanout factor (number of children), reducing the height of tree and, thus, reducing the number of disk access that we have to make in order to find a leave. Remember that every branch is a disk access. For instance, if you pack X keys in a node of a B+tree every node will point to at most X+1 children.

Also, remember that a B+tree is structured in a way that only the leaves store the actual data. That way, you can pack more keys in the internal nodes in order to fill up one disk block, that, for instance, stores one node of a B+tree. The more keys you pack in a node the more children it will point to and the shorter your tree will be, thus reducing the number of disk access in order to find one leave.

But these nodes are again stored as linked lists right?

Also, in a B+tree structure, sometimes the leaves are stored in a linked list fashion. Remember that only the leaves store the actual data. That way, with the linked list idea, when you have to perform a sequential access after finding one block you would do it faster than having to traverse the tree again in order to find the next block, right? The problem is that you still have to find the first block! And for that, the B+tree is way better than the linked list.

Imagine that if all the accesses were sequential and started in the first block of the disk, an array would be better than the linked list, because in a linked list you still have to deal with the pointers. But, the majority of disk accesses, according to Tanenbaum, are not sequential and are accesses to files of small sizes (like 4KB or less). Imagine the time it would take if you had to traverse a linked list every time to access one block of 4KB...

This article explains it way better than me and uses pictures as well: https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees

_"Also, remember that a B+tree is structured in a way that only the leaves store the actual data."_ In case someone reads this (the Q itself asks for B-trees). That's not a requirement for a B-tree, *only* for B+-trees. — Amelio Vazquez-Reina, Mar 21 '20 at 22:01

user207421 · Answer 2 · 2015-09-10T22:53:46.290

6

A B-tree node is essentially an array, of pairs {key, link}, of a fixed size which is read in one chunk, typically some number of disk blocks. The links are all downwards. At the bottom layer the links point to the associated records (assuming a B+-tree, as in any practical implementation).

I don't know where you got the linked list idea from.

edited Sep 10 '15 at 22:53

answered Sep 10 '15 at 22:43

user207421

305,947
44
307
483

I think the statement in the OP _"these **nodes** are again stored as linked lists right?"_ suggests that the confusion may come from the fact that yes, e.g. to search / fetch data in a B-Tree, one needs to follow a **sequence** of "links" (lookups) as you traverse the tree node by node. That's not a linked list per se though, and as you also correctly mentioned, individual node keys are usually stored as an array. – Amelio Vazquez-Reina Mar 21 '20 at 21:55

Gene · Answer 3 · 2015-09-10T23:34:39.043

4

Each node in a B-tree implemented in disk storage consists of a disk block (normally a handful of kilobytes) full of keys and "pointers" that are accessed as an array and not - as you said - a linked list. The block size is normally file-system dependent and chosen to use the file system's read and write operations efficiently. The pointers are not normal memory pointers, but rather disk addresses, again chosen to be easily used by the supporting file system.

edited Sep 10 '15 at 23:34

answered Sep 10 '15 at 22:40

Gene

46,253
4
58
96

You've left no room for the filenames. – user207421 Sep 10 '15 at 22:57
@EJP What filenames? Most B-trees implementations for database purposes either use a bare partition with no file structure or place the entire tree inside a single file. – Gene Sep 10 '15 at 23:12
I'm referring to your now-deleted remarks about NTFS, which is a directory system, which needs filenames as keys and disk addresses as pointers. You left out the filename when computing the number of elements in a block. – user207421 Oct 05 '16 at 22:39

score 2 · Answer 4 · answered Sep 10 '15 at 22:51

2

The main reason for B-tree is how it behaves on changes. If you have permanent structure, BST is OK, but in that case Hash function is even better. In case of file systems, you want a structure which changes as a whole as little as possible on inserts or deletes, and where you can perform find operation with as little reads as possible - these properties have B-trees.

answered Sep 10 '15 at 22:51

Robert Goldwein

5,805
6
33
38

1

I wouldn't say it's the main reason. It's an inherent consequence of the higher order. – user207421 Sep 10 '15 at 22:56
1

You're right, I meant main reason for choosing high order B-trees for indexes in file systems (on disk, and Splay trees in memory). – Robert Goldwein Sep 10 '15 at 23:07

Why B-Tree for file systems?

4 Answers4