2

So I have the following problem: I need to summarise the byte size of all files in a specific directory, this includes the size of the sub-directories, as in my case they actually can increase in size.

But if we run this code on a directory that contains files and sub-directories and look like this:

#include <filesystem>
#include <iostream>
#include <cstdint>

int main(void)
{
    std::uintmax_t result = 0; 
    for (const auto& path : std::filesystem::directory_iterator("."))
    {
        result += std::filesystem::file_size(path)
    }

    std::cout << "Total size is: " << result << std::endl; 
    return 0;
}

Then you will get an error that you are are trying to get the file size of a directory. If you run it on macOS or Linux at least compiling with Clang++ 10 or 11. Now according to Cppreference on std::filesystem::file_size getting the size of directory is up to the implementation. However, in my opinion, this is weird as file_size basically just "wraps" stat and therefore should work perfectly on a directory, at least on Linux, *BSD, and macOS.

So can anyone enlightment me as to why this has been left to the implementation, I have access to the C++ standard and cannot find a good reason.

cigien
  • 57,834
  • 11
  • 73
  • 112
Lars Nielsen
  • 2,005
  • 2
  • 25
  • 48
  • Does `stat` on a directory actually do what you want? I think a lot of people would expect the size of a directory to be the recursive size of all files container within that directory. I think it's fair to ask why they decided to make it fail on POSIX systems, but leaving it implementation-defined seems pretty reasonable to me. – jamesdlin May 23 '20 at 18:31
  • 1
    `this is weird as file_size basically just "wraps" stat` What if it doesn't? What if there comes SuperNewOS that has different semantics? You design api to be OS-agnostic, not to be tied to specific `stat` implementation. – KamilCuk May 23 '20 at 18:32
  • There can be who-knows-what in a directory (filemapped devices, special files (e.g. /dev/random) etc.) and if you want to do it recursively, you can get into loop through symlinks. And this also can take a lot of time (just try `du -s /` and you will see. And the return value will probably be garbage due to access restrictions and weird files in /dev etc.) – n314159 May 23 '20 at 18:33
  • @n314159 that I know. I have a du run at hour 12 now XD – Lars Nielsen May 23 '20 at 19:39

1 Answers1

3

The size of a directory can mean different things on different platforms and even different filesystems on the same platform: maybe the size of the disk allocation that holds the file list, or the number of files contained in the directory, or something else. On some platforms/filesystems there may not be a readily-accessible size that makes sense, so an error could be thrown instead.

There is no universal definition of "size of a directory" that applies everywhere, so the specification leaves it implementation-defined.

The proper way to determine how much disk space is used by a directory is to recursively look for files in that directory and sum their sizes -- but beware of:

  • Multiple hard links to the same file; you should only count one or you will over-report the used space.
  • Apparent size vs actual size; a sparse file might have an apparent size in the terabytes but only actually have a few KBs-worth of allocated extents.
  • Symlinks; will you count them for their own usage only, or the usage of the target?
cdhowie
  • 158,093
  • 24
  • 286
  • 300