10

According to this paper on Facebook's Haystack:

"Because of how the NAS appliances manage directory metadata, placing thousands of files in a directory was extremely inefficient as the directory’s blockmap was too large to be cached effectively by the appliance. Consequently it was common to incur more than 10 disk operations to retrieve a single image. After reducing directory sizes to hundreds of images per directory, the resulting system would still generally incur 3 disk operations to fetch an image: one to read the directory metadata into memory, a second to load the inode into memory, and a third to read the file contents."

I had assumed the filesystem directory metadata & inode would always be cached in RAM by the OS and a file read would usually require just 1 disk IO.

Is this "multiple disk IO's to read a single file" problem outlined in that paper unique to NAS appliances, or does Linux have the same problem too?

I'm planning to run a Linux server for serving images. Any way I can minimize the number of disk IO - ideally making sure the OS caches all the directory & inode data in RAM and each file reads would only require no more than 1 disk IO?

user9517
  • 115,471
  • 20
  • 215
  • 297
  • 1
    Not an answer to the question, but you can always use Varnish (Facebook uses it) which maintain the files in the memory. In this way if one image becomes hot (a lot of request to the same file), disk IO will not be used at all to serve it –  Jan 26 '12 at 16:21
  • Darhazer - Varnish wouldn't help here as the Linux file cache (which Varnish relies on) already caches hot files in memory. Putting Varnish in front of Nginx for static file serving doesn't really add anything. My question is about when the files are too big/too many to be cached in memory. I'd still want to make sure at least the directory data & inodes are cached to reduce the disk IO to just 1 per read. –  Jan 26 '12 at 16:34
  • Many filesystems store the inode inside the directory, reducing the number of requests by one, and significantly increasing the chance of a cache hit. But this isn't a programming question. – Ben Voigt Jan 26 '12 at 18:57
  • You could change the block size of the file system when creating it, for instance with the `mke2fs -b 32768` to make it 32k. However, this is useful only if you don't have small files on that file system. –  Jan 26 '12 at 20:03

3 Answers3

5

Linux has the same "problem". Here is a paper a student of mine published two years ago, where the effect is shown on Linux. The multiple IOs can come from several sources:

  • Directory lookup on each directory level of the file path. It may be necessary to read the directory inode and one or more directory entry blocks
  • Inode of the file

In normal IO pattern, caching is really effective and inodes, directories, and data blocks are allocated in ways that reduce seeks. However, the normal lookup method, which is actually shared by all file systems, is bad for highly randomized traffic.

Here are a few ideas:

1) The filesystem-related caches help. A large cache will absorb most of the reads. However, if you want to put several disks in a machine, the Disk-to-RAM ratio limits how much is cached.

2) Don't use millions of small files. Aggregate them to larger files and store the filename and the offset within the file.

3) Place or cache the metadata on an SSD.

4) And of course use a filesystem that does not have a totally anarchic on-disk directory format. A readdir should not take more than linear time, and direct file access ideally just logarithmic time.

Keeping directories small (less than 1000 or so) should not help so much because you would need more directories with need to be cached to.

notpeter
  • 3,515
  • 2
  • 26
  • 44
dmeister
  • 195
  • 5
1

This depends on filesystem you plan to use. Before read file data system:

  • Read directory file.
  • Read inode of your's file
  • Read sectors of your's file

If folder contains huge number of files, this is big preassure on cache.

  • If you are listing the I/O accesses, it might be more interesting to separate those performed by `open()` from those performed by `read()`. The page http://www.win.tue.nl/~aeb/linux/vfs/trail.html shows a nice walkthough of the different Kernel concepts involved. (Maybe it's outdated? I wouldn't be able to tell.) – adl Jan 27 '12 at 11:47
0

You probably won't be able to keep all of the directory and inode data in RAM, since you probably have more directory and inode data than RAM. You also might not want to, as that RAM might be better used for other purposes; in your image example, wouldn't you prefer to have the data of a frequently accessed image cached in RAM than the directory entry for an infrequently accessed image?

That said, I think the vfs_cache_pressure knob is used to control this. "When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions."

Samuel Edwin Ward
  • 2,363
  • 3
  • 14
  • 12