fastest way to sum the file sizes by owner in a directory

Question

I'm using the below command using an alias to print the sum of all file sizes by owner in a directory

ls -l $dir | awk ' NF>3 { file[$3]+=$5 } \
END { for( i in file) { ss=file[i]; \
if(ss >=1024*1024*1024 ) {size=ss/1024/1024/1024; unit="G"} else \ 
if(ss>=1024*1024) {size=ss/1024/1024; unit="M"} else {size=ss/1024; unit="K"}; \
format="%.2f%s"; res=sprintf(format,size,unit); \
printf "%-8s %12d\t%s\n",res,file[i],i }}' | sort -k2 -nr

but, it doesn't seem to be fast all the times.

Is it possible to get the same output in some other way, but faster?

[why not parse ls](https://unix.stackexchange.com/questions/128985/why-not-parse-ls) — Barmar, Mar 21 '19 at 15:18
When it's slow, how fast is `ls -l $dir` alone? On some file systems, listing large directories is very, very slow. — Aaron Digulla, Mar 21 '19 at 16:29
@AaronDigulla.. yes the ls -l $dir is also slow.. there are more than 2000 files created by different functional ids.. — stack0114106, Mar 21 '19 at 17:07

Shawn · Answer 1 · 2019-03-21T16:35:57.683

4

Another perl one, that displays total sizes sorted by user:

#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Spec;
use Fcntl qw/:mode/;

my $dir = shift;
my %users;

opendir(my $d, $dir);
while (my $file = readdir $d) {
  my $filename = File::Spec->catfile($dir, $file);
  my ($mode, $uid, $size) = (stat $filename)[2, 4, 7];
  $users{$uid} += $size if S_ISREG($mode);
}
closedir $d;

my @sizes = sort { $a->[0] cmp $b->[0] }
  map { [ getpwuid($_) // $_, $users{$_} ] } keys %users;
local $, = "\t";
say @$_ for @sizes;

edited Mar 21 '19 at 16:35

answered Mar 21 '19 at 16:27

Shawn

47,241
3
26
60

@stack0114106 It limits the size tracking to regular files - skips directories, fifos, sockets, devices, etc. Same idea as the `-f $file` in another answer, just a different way of checking. – Shawn Mar 21 '19 at 21:09

Stefan Becker · Answer 2 · 2019-03-21T16:35:25.343

Parsing output from ls - bad idea.

How about using find instead?

start in directory ${dir}
- limit to that directory level (-maxdepth 1)
- limit to files (-type f)
- print a line with user name and file size in bytes (-printf "%u %s\n")
run the results through a perl filter
- split each line (-a)
- add to a hash under key (field 0) the size (field 1)
- at the end (END {...}) print out the hash contents, sorted by key, i.e. user name

$ find ${dir} -maxdepth 1 -type f -printf "%u %s\n" | \
     perl -ane '$s{$F[0]} += $F[1]; END { print "$_ $s{$_}\n" foreach (sort keys %s); }'
stefanb 263305714

A solution using Perl:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use File::Spec;

my %users;
foreach my $dir (@ARGV) {
    opendir(my $dh, $dir);

    # files in this directory
    while (my $entry = readdir($dh)) {
        my $file = File::Spec->catfile($dir, $entry);

        # only files
        if (-f $file) {
            my($uid, $size) = (stat($file))[4, 7];
            $users{$uid} += $size
        }
    }

    closedir($dh);
}

print "$_ $users{$_}\n" foreach (sort keys %users);

exit 0;

Test run:

$ perl dummy.pl .
1000 263618544

Interesting difference. The Perl solution discovers 3 more files in my test directory than the find solution. I have to ponder why that is...

It should print for all owners..not just the current user.. the files are owned by diff users — stack0114106, Mar 21 '19 at 16:14

James Brown · Answer 3 · 2019-03-22T07:08:10.557

Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:

$ cat bar.awk
@load "filefuncs"
BEGIN {
    FS=":"                                     # passwd field sep
    passwd="/etc/passwd"                       # get usernames from passwd
    while ((getline < passwd)>0)
        users[$3]=$1
    close(passwd)                              # close passwd

    if(path="")                                # set path with -v path=...
        path="."                               # default path is cwd
    pathlist[1]=path                           # path from the command line
                                               # you could have several paths
    fts(pathlist,FTS_PHYSICAL,filedata)        # dont mind links (vs. FTS_LOGICAL)
    for(p in filedata)                         # p for paths
        for(f in filedata[p])                  # f for files
            if(filedata[p][f]["stat"]["type"]=="file")      # mind files only
                size[filedata[p][f]["stat"]["uid"]]+=filedata[p][f]["stat"]["size"]
    for(i in size)
        print (users[i]?users[i]:i),size[i]    # print username if found else uid
    exit
}

Sample outputs:

$ ls -l
total 3623
drwxr-xr-x 2 james james  3690496 Mar 21 21:32 100kfiles/
-rw-r--r-- 1 root  root         4 Mar 21 18:52 bar
-rw-r--r-- 1 james james      424 Mar 21 21:33 bar.awk
-rw-r--r-- 1 james james      546 Mar 21 21:19 bar.awk~
-rw-r--r-- 1 james james      315 Mar 21 19:14 foo.awk
-rw-r--r-- 1 james james      125 Mar 21 18:53 foo.awk~
$ awk -v path=. -f bar.awk
root 4
james 1410

Another:

$ time awk -v path=100kfiles -f bar.awk
root 4
james 342439926

real    0m1.289s
user    0m0.852s
sys     0m0.440s

Yet another test with a million empty files:

$ time awk -v path=../million_files -f bar.awk

real    0m5.057s
user    0m4.000s
sys     0m1.056s

looks like my awk doesn't have filefuncs ````awk: foo.awk:1: ^ invalid char '@' in expression```` — stack0114106, Mar 21 '19 at 18:04
this is in Enterprise Linux - RHEL 6.10.. I see gawk pointing to /bin/gawk and the version is GNU Awk 3.1.7.. does it support @loadfiles?.. or is there any other location that would have another awk??.. — stack0114106, Mar 21 '19 at 18:24
A wild guess that extensions came in GNU awk 4. But I saw you mentioned 300k files, this solution can't handle that many. — James Brown, Mar 21 '19 at 18:29
ok.. anyway good to know loadfiles...I did run this in my cygwin and it works..so ++ — stack0114106, Mar 21 '19 at 18:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/190481/discussion-between-james-brown-and-stack0114106). — James Brown, Mar 22 '19 at 07:10

zdim · Accepted Answer · 2020-12-07T01:49:03.140

2

Get a listing, add up sizes, and sort it by owner (with Perl)

perl -wE'
    chdir (shift // "."); 
    for (glob ".* *") { 
        next if not -f;
        ($owner_id, $size) = (stat)[4,7]
            or do { warn "Trouble stat for: $_"; next };
        $rept{$owner_id} += $size 
    } 
    say (getpwuid($_)//$_, " => $rept{$_} bytes") for sort keys %rept
'

I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to glob-ed (while I found glob much faster in a related problem).

I expect good runtimes in comparison with ls, which slows down dramatically as a file list in a single directory gets long. This is due to the system so Perl will be affected as well but as far as I recall it handles it far better. However, I've seen a dramatic slowdown only once entries get to half a million or so, not a few thousand, so I am not sure why it runs slow on your system.

If this need be recursive in directories it finds then use File::Find. For example

perl -MFile::Find -wE'
    $dir = shift // "."; 
    find( sub { 
        return if not -f;
        ($owner_id, $size) = (stat)[4,7] 
            or do { warn "Trouble stat for: $_"; return }; 
        $rept{$owner_id} += $size 
    }, $dir ); 
    say (getpwuid($_)//$_, "$_ => $rept{$_} bytes") for keys %rept
'

This scans a directory with 2.4 Gb, of mostly small files over a hierarchy of subdirectories, in a little over 2 seconds. The du -sh took around 5 seconds (the first time round).

It is reasonable to bring these two into one script

use warnings;
use strict;
use feature 'say';    
use File::Find;
use Getopt::Long;

my %rept;    
sub get_sizes {
    return if not -f; 
    my ($owner_id, $size) = (stat)[4,7] 
        or do { warn "Trouble stat for: $_"; return };
    $rept{$owner_id} += $size 
}

my ($dir, $recurse) = ('.', '');
GetOptions('recursive|r!' => \$recurse, 'directory|d=s' => \$dir)
    or die "Usage: $0 [--recursive] [--directory dirname]\n";

($recurse) 
    ? find( { wanted => \&get_sizes }, $dir )
    : find( { wanted => \&get_sizes, 
              preprocess => sub { return grep { -f } @_ } }, $dir );

say (getpwuid($_)//$_, " => $rept{$_} bytes") for keys %rept;

I find this to perform about the same as the one-dir-only code above, when run non-recursively (default as it stands).

Note that File::Find::Rule interface has many conveniences but is slower in some important use cases, what clearly matters here. (That analysis should be redone since it's a few years old.)

edited Dec 07 '20 at 01:49

answered Mar 21 '19 at 17:31

zdim

64,580
5
52
81

as well as getpwuid possibly returning nothing (and thus merging distinct uids), if you invoke it in the find sub, you call it once per file, compared to once per uid if you do it during the say. – jhnc Mar 21 '19 at 19:01
@jhnc Yes, on both: (1) just added error handling, (2) I'm not concerned with an extra syscall in processing (getting list out of the system is slow) and wanted to gather names, but yes, that would be faster (and probably generally better to keep it as returned by `stat`) – zdim Mar 21 '19 at 19:15
@stack0114106 Ah! So that must've been about users that have been removed (or such) so `getwpuid` returned nothing (`undef`) --- a reminder to _always, indeed_ include all requisite tests!!! (Still don't see why the debugging prints failed with the warning "_uninitialized value `$owner_id`_") – zdim Mar 21 '19 at 19:25
@zdim I created a folder with 200k files and ran your code with getpwuid in `for` and then moved to `say`. First took 2.456s/1.063s/1.369s, second took 0.862s/0.347s/0.515s. Those extra calls add up! (on an SSD at least...) :-) – jhnc Mar 21 '19 at 20:43
btw, I think your find version has typo in the regexp - should be `/^\.\.?$/` or similar – jhnc Mar 21 '19 at 20:56
@jhnc ouch, absolutely need be `\.` in regex -- thank you! Funnily, it worked in my tests because it excludes only entries with one or two chars in name, and the only such ones I have are indeed `.` and `..` :). Will revisit this as soon as I can, and will now try to optimize. Thanks for timing it -- I suspect that it will always be (more or less) the same; so that when processing drops to minutes (for hundreds of thousands of files in a dir) that we still have on the order of a second of the delay because of those syscalls. I'll time it :) – zdim Mar 21 '19 at 21:04
@stack0114106 Changed the code to call `getpwuid` only when printing the report. It is generally better, and it does appear to have a large effect, increasing with the number of files (?!). I guess I misgauged that. More to come... – zdim Mar 22 '19 at 09:10
@stack0114106 By "_current directory_" do you mean without recursion? Like with a command-line flag, either do current dir only or do it all with recursion? – zdim Mar 28 '19 at 20:07
@zdim.. yes.. right.. like one solution is the extension of other with some change of flags – stack0114106 Mar 28 '19 at 20:08
@zdim.. cleaned up mine.. :-) – stack0114106 Mar 28 '19 at 20:16
@stack0114106 Edited. Apart from adding a single script for both, I also changed the recursive one a little: it was including the sizes of directories, what I thought you probably don't want here. If you actually do please let me know and I'll restore the previous behavior. – zdim Mar 29 '19 at 07:39
@zdim.. thank you so much for consolidating..yes, the directory sizes can be omitted.. – stack0114106 Mar 29 '19 at 09:00

jhnc · Answer 5 · 2019-03-21T20:08:35.310

Not sure why question is tagged perl when awk is being used.

Here's a simple perl version:

#!/usr/bin/perl

chdir($ARGV[0]) or die("Usage: $0 dir\n");

map {
    if ( ! m/^[.][.]?$/o ) {
        ($s,$u) = (stat)[7,4];
        $h{$u} += $s;
    }
} glob ".* *";

map {
    $s = $h{$_};
    $u = !( $s      >>10) ? ""
       : !(($s>>=10)>>10) ? "k"
       : !(($s>>=10)>>10) ? "M"
       : !(($s>>=10)>>10) ? "G"
       :   ($s>>=10)      ? "T"
       :                    undef
       ;
    printf "%-8s %12d\t%s\n", $s.$u, $h{$_}, getpwuid($_)//$_;
} keys %h;

glob gets our file list
m// discards . and ..
stat the size and uid
accumulate sizes in %h
compute the unit by bitshifting (>>10 is integer divide by 1024)
map uid to username (// provides fallback)
print results (unsorted)
NOTE: unlike some other answers, this code doesn't recurse into subdirectories

To exclude symlinks, subdirectories, etc, change the if to appropriate -X tests. (eg. (-f $_), (!-d $_ and !-l $_), etc). See perl docs on the _ filehandle optimisation for caching stat results.

I don't see `m///` in the script. My guess is you're referring to `!/^[.][.]?$/o`? — Aaron Digulla, Mar 21 '19 at 16:27
yes. `//` is shortcut for `m//`. `m` is only needed if you want to use different delimiter (eg `m[]`, `m<>`, etc). Three slashes was typo. — jhnc, Mar 21 '19 at 16:29
Please either use `m//` in the script or use the code from the script in the explanation. As it is, it's very confusing for people who don't know a lot about Perl. — Aaron Digulla, Mar 21 '19 at 16:32

agc · Answer 6 · 2019-03-22T11:40:40.613

0

Using datamash (and Stefan Becker's find code):

find ${dir} -maxdepth 1 -type f -printf "%u\t%s\n" | datamash -sg 1 sum 2

edited Mar 22 '19 at 11:40

answered Mar 22 '19 at 11:36

agc

7,973
2
29
50

@agc..the answer seems to be simple.. is datamash available in RHEL 6.1? – stack0114106 Mar 22 '19 at 11:38
@stack0114106, Not sure -- [RPM files exist](https://rpmfind.net/linux/rpm2html/search.php?query=datamash&submit=Search+...), but whether those work in RHEL 6.1 is unclear without a 6.1 box to test on. – agc Mar 22 '19 at 11:58

fastest way to sum the file sizes by owner in a directory

6 Answers6

Linked