3

What I am trying to do is to write a C program on linux which should be checking in the current directory if there are sparse files, and also I would like to print the number of disk blocks that already represent gaps in the file and the number of disk blocks that are 0-filled but take up disk space.

So far I can access the current directory and print just the files with

DIR *dirp;
struct dirent *dp;

To get done the second part with sparse file I tried to use stat() but it seems not to be working because I don't get the required results as I wished.

So, could anyone show me how to do the part with the sparse file?

abligh
  • 24,573
  • 4
  • 47
  • 84

3 Answers3

5

If you want to look for holes in sparse files, see the manpage for lseek, specifically the bit concerning SEEK_HOLE and SEEK_DATA.

If you want to just know the allocated size on disk, then look at then manpage for stat (2):

       struct stat {
           ...
           off_t     st_size;    /* total size, in bytes */
           ...
           blksize_t st_blksize; /* blocksize for file system I/O */
           blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
       };

st_size tells you the total size in bytes, st_blksize * st_blocks gives you the allocated size. If you round st_size up to the next multiple of st_blksize and subtract the file size, that's the size of the holes.

abligh
  • 24,573
  • 4
  • 47
  • 84
  • Hi, I didn't understand the part with "If you round st_size up to the next multiple of st_blksize and subtract the file size, that's the size of the holes". Could you explain me that? –  Feb 01 '14 at 15:58
  • I mean, how do you round up the st_size to the next multiple of st_blksize? –  Feb 01 '14 at 16:08
  • Something like: `off_t sz = (st.st_size + st.st_blksize - 1) & ~st.st_blksize`. That exploits the fact `st_blksize` is always a power of two, and that a binary `&` with the one's complement of a power of two zeroes the relevant bits and rounds down to that power of two, so if you add the power of two less one, it rounds up to the power of two. – abligh Feb 01 '14 at 16:25
  • 1
    The calculation isn't that simple. You have to account for indirect blocks, which make `st_blocks` higher than expected from `st_size` while sparseness makes it lower. –  Feb 01 '14 at 16:27
  • @WumpusQ.Wumbley I'm not sure that's right as I am rounding up the apparent file size (i.e. including holes). What `du` does is here: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/du.c#n532 ; I can't see that doing an indirect block calculation from the macro. However, it would be trivial to copy line 535. – abligh Feb 01 '14 at 16:53
  • `du` doesn't need to do it because it just reports `st_blocks` as is, without making any promise about its relationship to the file size. (unless you use `-b` in which case it does the opposite.) –  Feb 01 '14 at 16:56
  • @WumpusQ.Wumbley oh I see what you mean. `stat`'s `st_blocks` gives you the number of blocks including indirect blocks, so if you subtract from the file size's block count you are out by the number of indirect blocks. OK, suggest the `SEEK_HOLE` and `SEEK_DATA` method is the easiest then. – abligh Feb 01 '14 at 16:57
  • @abligh: I tried but I couldn't figure it out how to do it with **SEEK_HOLE** and **SEEK_DATA**. Could you show me that? –  Feb 01 '14 at 17:01
  • @abligh: Also it seems, that (st.st_blocks * st.st_blksize < st.st_size) is not working. Any idea why? –  Feb 01 '14 at 17:03
  • @user2965601 you will have to repeatedly seek in turn for `SEEK_HOLE` and `SEEK_DATA`. When you use `SEEK_HOLE` you are looking for a hole (so skipping over data), so add to a count the amount you've skipped over. There is an implicit 'hole' at the end of the file so you can stop counting when your first `lseek()` fails . – abligh Feb 01 '14 at 17:05
  • @user2965601 as @WumpusQ.Wumbley pointed out, `st.st_blocks * st.st_blksize` may (slightly) exceed `st.st_size` because of `st.st_blocks` may include indirect blocks. Also, as the last block may be incomplete, you would have to round up `st.st_size` for that comparison to be true even if there were no indirect blocks e.g. a one character file will use one whole block, so `st.st_blocks * st.st_blksize` will be 512 with a `st_blksize` of 512, which will not be smaller than 1 (`st_size`). – abligh Feb 01 '14 at 17:08
  • @abligh: Ok, I see. By rounding up st.st_size you mean off_t sz = (st.st_size + st.st_blksize - 1) & ~st.st_blksize? Is this the rounded up st.st_size? –  Feb 01 '14 at 17:20
  • @user2965601 yes, that's what I meant. – abligh Feb 01 '14 at 22:37
1

Check size, returned by du utility, and compare with the "apparent size". If you wish you may take a look on the block counting algorithm from du

user3159253
  • 16,836
  • 3
  • 30
  • 56
  • I just looked up in the du man page. This is a shell command but I don't use the shell command du. Well, I don't use shell commands at all. Is there not another way to get this done? –  Feb 01 '14 at 14:58
  • It should be possible to get it done with stat(), shouldn't it? –  Feb 01 '14 at 15:05
  • Note that this is file-system dependent -- for example, on a ZFS with compression enabled, a non-sparse file could show up as using less space than its apparent size. – Alex Jun 26 '14 at 18:15
1

You can try the following trick with stat result:

if (st.st_blocks * st.st_blksize < st.st_size) { 
  SPARSE-FILE 
} else { 
  PROBABLY NOT SPARSE
}

Not sure if it identifies all sparse files however.

Marian
  • 7,402
  • 2
  • 22
  • 34
  • Are your sure, that this returns all sparse-files? –  Feb 01 '14 at 16:06
  • @user2965601 I've edited a bit the original code. Anyway, as indicated in the answer, I am not sure if it finds all sparse-files. – Marian Feb 01 '14 at 16:12
  • This answer is definitely wrong. `st_blocks` is the optimal I/O size, and it is not connected in any way to `st_blksize`. – jimis Jul 13 '16 at 12:52