Disk usage of files whose names match a regex, in Linux?

Question

So, in many situations I wanted a way to know how much of my disk space is used by what, so I know what to get rid of, convert to another format, store elsewhere (such as data DVDs), move to another partition, etc. In this case I'm looking at a Windows partition from a SliTaz Linux bootable media.

In most cases, what I want is the size of files and folders, and for that I use NCurses-based ncdu:

ncdu

But in this case, I want a way to get the size of all files matching a regex. An example regex for .bak files:

.*\.bak$

How do I get that information, considering a standard Linux with core GNU utilities or BusyBox?

Edit: The output is intended to be parseable by a script.

score 58 · Accepted Answer · answered Feb 28 '12 at 19:01

58

I suggest something like: find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1

Some notes:

The -print0 option for find and --files0-from for du are there to avoid issues with whitespace in file names
The regular expression is matched against the whole path, e.g. ./dir1/subdir2/file.bak, not just file.bak, so if you modify it, take that into account
I used h flag for du to produce a "human-readable" format but if you want to parse the output, you may be better off with k (always use kilobytes)
If you remove the tail command, you will additionally see the sizes of particular files and directories

Sidenote: a nice GUI tool for finding out who ate your disk space is FileLight. It doesn't do regexes, but is very handy for finding big directories or files clogging your disk.

answered Feb 28 '12 at 19:01

Michał Kosmulski

9,855
1
32
51

3

+1, looks cool! What about `-s` for `du`? Can't check right now, but I believe `du` can display the grand total without the need for `tail`. That FileLight tool reminds of Gnome's Disk Usage Analyzer. Still, I find the "details view-like" interface of the ncdu app I mentioned in the OP more straightforward, but the diversity is good :) (I've already opened Disk Usage Analyzer to make someone think from that slick UI that what I was doing to fix his PC was more complex than what it actually was... It works! Hehe). – Camilo Martin Feb 28 '12 at 23:32
1

`-s` displays the grand total for each argument separately - here we have multiple arguments, therefore `-c` is the option we need. – Michał Kosmulski Feb 29 '12 at 08:06
Thanks, checked and it works (but not with BusyBox' `du`, since it doesn't support `--files0-from`, so I installed coreutils), so I'll accept this one as it seems inmune to terrorist filenames. – Camilo Martin Feb 29 '12 at 15:05
I get a filename too long error (i filter for 100k or more files) – basZero May 18 '16 at 14:56
2

There is a flaw here. The find as shown will include directories. du will then total both for the directory and the files in the directory. Nested directories will be counted multple times. I suggest using "-type f" in the find selection. – Steven the Easily Amused Mar 24 '17 at 18:32
This is great. On macOS, after doing a `brew install coreutils` to get the GNU variants for `find` and `du`, the command looks like this: `gfind $HOME/ons -type f -regex '.*pre-commit-config.yaml' -print0 | gdu --files0-from=- -cb | sort -n` . This adds a sort to the original command and outputs file sizes in bytes to allow the sort to work. – Ashutosh Jindal Feb 10 '22 at 10:28

score 30 · Answer 2 · answered Jul 10 '12 at 14:02

30

du is my favorite answer. If you have a fixed filesystem structure, you can use:

du -hc *.bak

If you need to add subdirs, just add:

du -hc *.bak **/*.bak **/**/*.bak

etc etc

However, this isn't a very useful command, so using your find:

TOTAL=0;for I in $(find . -name \*.bak); do  TOTAL=$((TOTAL+$(du $I | awk '{print $1}'))); done; echo $TOTAL

That will echo the total size in bytes of all of the files you find.

Hope that helps.

answered Jul 10 '12 at 14:02

MaddHacker

1,118
10
17

this does not support regular expressions – Felipe Alvarez Nov 15 '13 at 04:44
It has the same problem I noted on another answer. Since directories can be named "*.bak" du will both count the files in the directory named .bak as well as the entire directory itself! This will cause an overcount and a double count (or worse, if you have nested .bak directories). – Steven the Easily Amused Mar 24 '17 at 18:37
I just added -s to this to get it to include sub folders. – cjbarth Jan 06 '20 at 17:09

score 3 · Answer 3 · edited Apr 18 '17 at 12:23

3

The previous solutions didn't work properly for me (I had trouble piping du) but the following worked great:

find path/to/directory -iregex ".*\.bak$" -exec du -csh '{}' + | tail -1

The iregex option is a case insensitive regular expression. Use regex if you want it to be case sensitive.

If you aren't comfortable with regular expressions, you can use the iname or name flags (the former being case insensitive):

find path/to/directory -iname "*.bak" -exec du -csh '{}' + | tail -1

In case you want the size of every match (rather than just the combined total), simply leave out the piped tail command:

find path/to/directory -iname "*.bak" -exec du -csh '{}' +

These approaches avoid the subdirectory problem in @MaddHackers' answer.

Hope this helps others in the same situation (in my case, finding the size of all DLL's in a .NET solution).

edited Apr 18 '17 at 12:23

Mecki

125,244
33
244
253

answered Dec 10 '12 at 22:59

ben.snape

1,495
10
21

1

One should note that `+` means `find` will try to call the `du` command as little as possible by appending as many hits as possible to a single `du` call, however due to system limitations (e.g. max. no. of allowed arguments), it may not be possible to append all hits to a single `du` call, then it will split them across multiple calls and this will cause an incorrect result. – Mecki Apr 18 '17 at 12:08
1

Oh, and you forgot to quote `*.bak`. In your sample the shell would expand it but you want `find` to expand it, so you must use `"*.bak"`. I'll fix that for you. – Mecki Apr 18 '17 at 12:22

Camilo Martin · Answer 4 · 2012-02-28T16:51:37.587

3

Run this in a Bourne Shell to declare a function that calculates the sum of sizes of all the files matching a regex pattern in the current directory:

sizeofregex() { IFS=$'\n'; for x in $(find . -regex "$1" 2> /dev/null); do du -sk "$x" | cut -f1; done | awk '{s+=$1} END {print s}' | sed 's/^$/0/'; unset IFS; }

(Alternatively, you can put it in a script.)

Usage:

cd /where/to/look
sizeofregex 'myregex'

The result will be a number (in KiB), including 0 (if there are no files that match your regex).

If you do not want it to look in other filesystems (say you want to look for all .so files under /, which is a mount of /dev/sda1, but not under /home, which is a mount of /dev/sdb1, add a -xdev parameter to find in the function above.

edited Feb 28 '12 at 16:51

answered Feb 28 '12 at 16:42

Camilo Martin

37,236
20
111
154

You shouldn't iterate over find's output using a for loop. This will break if a file has spaces. Use find -exec. Also, cut and sed wouldn't be needed to format the output. awk can do it all. – jordanm Feb 28 '12 at 16:47
Still pretty hackish even with IFS set. What is wrong with using find -exec? – jordanm Feb 28 '12 at 16:57
@jordanm I've always used `IFS=$'\n'` for reading lists, so I'm just used to it :P But you say that `awk` can do it all - I just scratch awk's surface, so if you could post a way of doing it with awk, and it's less hacky, I'll accept it :) I just wanted something that worked, and took me some time to make up that function, so I thought I should share it. It works acceptably fast enough for me actually, but if there's a better way I'm all for it. If It wasn't for a script, ~1 min. per HDD could be indeed too slow. – Camilo Martin Feb 28 '12 at 17:10
1

What you're doing here is a bad thing, because you're forgetting that file names on UNIX may contain newlines. The only disallowed character is `'\0'`. Recommended reading : http://mywiki.wooledge.org/ParsingLs (it's about `ls`, but don't be fooled by it : you're in the same trap) – Daniel Kamil Kozar Feb 28 '12 at 17:39
1

`du -sk build/ bin/ | awk '{s+=$1} END { if (s ~ /[0-9]+/) { print s; } else print "0"; }'`. awk can normally do the job of cut, but it your case cut is not needed anyways. – jordanm Feb 28 '12 at 17:44
Well, I did knew about Unix' disgraced support of newlines in filenames (which is really unfortunate), but only terrorists put newlines in their filenames (without accounting for the fact I'm myself am inspecting a Windows partition, and while NTFS would allow such a thing by itself, Windows won't.). Otherwise, +1 for the heads-up and a nice snippet, but it's just getting the sum of a couple of folders' sizes. If you know a way that takes the regex idea into account, and post it as an answer, I'll accept it :) – Camilo Martin Feb 28 '12 at 18:30
By the way, I think the busybox version of `sh`'s read does not accept NUL delimiters. `man read` [gives me this](http://i.stack.imgur.com/XZruM.png). :( Still, I could get the standard GNU packages. – Camilo Martin Feb 28 '12 at 18:32

score 1 · Answer 5 · answered Apr 18 '17 at 12:36

The accepted reply suggests to use

find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1

but that doesn't work on my system as du doesn't know a --files-0-from option on my system. Only GNU du knows that option, it's neither part of the POSIX Standard (so you won't find it in FreeBSD or macOS), nor will you find it on BusyBox based Linux systems (e.g. most embedded Linux systems) or any other Linux system that does not use the GNU du version.

Then there's a reply suggesting to use:

find path/to/directory -iregex .*\.bak$ -exec du -csh '{}' + | tail -1

This solution will work as long as there aren't too many files found, as + means that find will try call du with as many hits as possible in a single call, however, there might be a maximum number of arguments (N) a system supports and if there are more hits than this value, find will call du multiple times, splitting the hits into groups smaller than or equal to N items each and this case the result will be wrong and only show the size of the last du call.

Finally there is an answer using stat and awk, which is a nice way to do it, but it relies on shell globbing in a way that only Bash 4.x or later supports. It will not work with older versions and if it works with other shells is unpredictable.

A POSIX conform solution (works on Linux, macOS and any BSD variants), that doesn't suffer by any limitation and that will surely work with every shell would be:

find . -regex '.*\.bak' -exec stat -f "%z" {} \; | awk '{s += $1} END {print s}'

This is an excellent write-up, +1 - the finding about the argument count limitation is particularly important because it can give wrong results and drive someone mad until he figures it out. — Camilo Martin, May 19 '17 at 04:57

glenn jackman · Answer 6 · 2012-02-29T11:21:47.227

1

If you're OK with glob-patterns and you're only interested in the current directory:

stat -c "%s" *.bak | awk '{sum += $1} END {print sum}'

or

sum=0
while read size; do (( sum += size )); done < <(stat -c "%s" *.bak)
echo $sum

The %s directive to stat gives bytes not kilobytes.

If you want to descend into subdirectories, with bash version 4, you can shopt -s globstar and use the pattern **/*.bak

edited Feb 29 '12 at 11:21

answered Feb 28 '12 at 20:48

glenn jackman

238,783
38
220
352

So with Bash 4, `**/*.bak` means .bak files on *any subdirectory*? i.e., not just one directory below? – Camilo Martin Feb 28 '12 at 23:15
@glennjackman Too bad, it's not working in Bash 4.2 for me. See [this screenshot](http://i.stack.imgur.com/eRWaH.png). It only goes one folder below, as if `**/*.ext` was `*/*.ext`. – Camilo Martin Feb 29 '12 at 14:42
@CamiloMartin, did you `shopt -s globstar`? Try `echo $BASH_VERSION` to see what version your current shell is. This works for me: `mkdir -p a/b/c/d; touch a/b/c/d/file.txt; ls **/*txt` – glenn jackman Feb 29 '12 at 16:58
@glennjackman `echo $BASH_VERSION` gives me `4.2.0(2)-release`. After doing `shopt -s globstar` as you mentioned, it works in small folder structures, but if I try it on `/`, the CPU usage goes 100% and after a couple of minutes bash is killed. I don't know why, maybe it's because it's a VM on 256MB RAM (well, on this light distro it can browse the web and all with that), but still, seems too unreliable. – Camilo Martin Feb 29 '12 at 20:37
@CamiloMartin, it's probably not as efficiently implemented as `find`, but are you really crawling your entire filesystem for files? – glenn jackman Feb 29 '12 at 22:06
@glennjackman Not this filesystem's `/`, but another's, so yes, I need it not to die or memory leak if used in the root of a partition. As a side note, I've always found `find` SO MUCH superior to Windows' (unindexed) search feature... I don't understand why Linux scans an NTFS drive faster than Windows. And with regexes, none the less! – Camilo Martin Mar 02 '12 at 00:28

Disk usage of files whose names match a regex, in Linux?

6 Answers6

Linked

Related