6

Say I have the following structure of files and directories:

$ tree
.
├── a
├── b
└── dir
    └── c

1 directory, 3 files

That is, two files a and b together with a dir dir, where another file c stands.

I want to process all the files with awk (GNU Awk 4.1.1, exactly), so I do something like this:

$ gawk '{print FILENAME; nextfile}' * */*
a
b
awk: cmd. line:1: warning: command line argument `dir' is a directory: skipped
dir/c

All is fine but the * also expands to the directory dir and awk tries to process it.

So I wonder: is there any native way awk can check if the given element is a file or not and, if so, skip it? That is, without using system() for it.

I made it work by calling the external system in BEGINFILE:

$ gawk 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, FNR}' * */*
a
a 10
a.wk
a.wk 3
b
b 10
dir
dir is a dir, skipping
dir/c
dir/c 10

Note also the fact that if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile} works counter intuitively: it should return 1 when true, but it returns the exit code.

I read in A.5 Extensions in gawk Not in POSIX awk:

And then the linked page says:

4.11 Directories on the Command Line

According to the POSIX standard, files named on the awk command line must be text files; it is a fatal error if they are not. Most versions of awk treat a directory on the command line as a fatal error.

By default, gawk produces a warning for a directory on the command line, but otherwise ignores it. This makes it easier to use shell wildcards with your awk program:

$ gawk -f whizprog.awk *        Directories could kill this program

If either of the --posix or --traditional options is given, then gawk reverts to treating a directory on the command line as a fatal error.

See Extension Sample Readdir, for a way to treat directories as usable data from an awk program.

And in fact it is the case: the same command as before with --posix fails:

$ gawk --posix 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, NR}' * */*
gawk: cmd. line:1: fatal: cannot open file `dir' for reading (Is a directory)

I checked the 16.7.6 Reading Directories section that is linked above and they talk about readdir:

The readdir extension adds an input parser for directories. The usage is as follows:

@load "readdir"

But I am not sure neither how to call it nor how to use it from the command line.

tripleee
  • 175,061
  • 34
  • 275
  • 318
fedorqui
  • 275,237
  • 103
  • 548
  • 598

2 Answers2

6

I would simply avoid to pass directories to awk since even POSIX says that all filename args must be text files.

You can use find for traversing the directory:

find PATH -type f -exec awk 'program' {} +
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Yes! I think this is the cleanest way to do it. I nevertheless wonder if `awk` can do it in any way. I edited my question because I had mistakenly used `system()`, so now it works like that, but still I don't like the fact of calling an external command for this. – fedorqui Dec 01 '15 at 10:50
  • @fedorqui I also played around a bit with `@load readdir` (Nice to know, thanks).. I came to the same result, meaning using `system()` to check whether filename is a directory. I don't see a different way. – hek2mgl Dec 01 '15 at 10:51
  • Thanks again hek! I finally accepted Ed Morton's answer since it does it in an awk way. Even though the recommendation is not to do it in general. – fedorqui Dec 03 '15 at 10:03
  • 1
    @fedorqui Good decision! His answer is nice! – hek2mgl Dec 03 '15 at 10:08
2

If you wanted to safeguard your script from other people mistakenly passing a directory (or anything else that's not a readable text file) to it, you could do this:

$ ls -F tmp
bar  dir/  foo

$ cat tmp/foo
line 1

$ cat tmp/bar
line 1
line 2

$ cat tmp/dir
cat: tmp/dir: Is a directory

$ cat tst.awk
BEGIN {
    for (i=1;i<ARGC;i++) {
        if ( (getline line < ARGV[i]) <= 0 ) {
            print "Skipping:", ARGV[i], ERRNO
            delete ARGV[i]
        }
        close(ARGV[i])
    }
}
{ print FILENAME, $0 }

$ awk -f tst.awk tmp/*
Skipping: tmp/dir Is a directory
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1

$ awk --posix -f tst.awk tmp/*
Skipping: tmp/dir
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1

Per POSIX getline returns -1 if/when it fails trying to retrieve a record from a file (e.g. unreadable file or file does not exist or file is a directory), you just need GNU awk to tell you which of those failures it was by the value of ERRNO if you care.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    Niiiice! So `getline` on a directory does not directly fail but can be handled. – fedorqui Dec 03 '15 at 10:02
  • RIght. When I first read your question I thought you were trying to use awk to search for files/dirs (sorry - short attention span!) but on re-reading it looks like you just want to safeguard against someone calling the script with non-file args - there's nothing wrong with doing that and above is how you do it. I've updated my answer to be a bit more supportive of that! – Ed Morton Dec 03 '15 at 13:17
  • 1
    Yes, exactly. It is just to prevent warnings, or even exit codes, due to the fact that a dir is expanded in a supposedly just-files-list. Very interesting answer from which I learnt quite a lot, thanks : ) – fedorqui Dec 03 '15 at 15:28