AWK to process compressed files and printing original (compressed) file names

Question

I would like to process multiple .gz files with gawk. I was thinking of decompressing and passing it to gawk on the fly but I have an additional requirement to also store/print the original file name in the output.

The thing is there's 100s of .gz files with rather large size to process. Looking for anomalies (~0.001% rows) and want to print out the list of found inconsistencies ALONG with the file name and row number that contained it.

If I could have all the files decompressed I would simply use FILENAME variable to get this. Because of large quantity and size of those files I can't decompress them upfront.

Any ideas how to pass filename (in addition to the gzip stdout) to gawk to produce required output?

Etan Reisner · Accepted Answer · 2014-08-05T13:06:51.023

3

Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.

for file in *.gz; do
    gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done

Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.

$ ls *.awk
a.awk  e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk

To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.

edited Aug 05 '14 at 13:06

answered Aug 01 '14 at 22:54

Etan Reisner

77,877
8
106
148

That looks promising! Instead of 'for' loop could the list be taken from some (linu/uni)X regular command that simply takes wildcard pattern as argument to return the list of files? I'm just looking to do everyting from command line… – msciwoj Aug 02 '14 at 00:18
I don't think I understand the question. You have some command that is going to spit out a list of filenames to operate on? A standard command or something custom? – Etan Reisner Aug 03 '14 at 03:23
I was thinking of something like a combination of `ls` and `xargs` – msciwoj Aug 05 '14 at 13:00
@msciwoj Edited the answer with an additional option and comment. See if that helps. – Etan Reisner Aug 05 '14 at 13:07
@martin You can but you shouldn't. See http://mywiki.wooledge.org/DontReadLinesWithFor for why. Also the linked (and slightly more appropriate for your suggestion) http://mywiki.wooledge.org/ParsingLs. – Etan Reisner Aug 05 '14 at 15:12
@EtanReisner Ah, yes, thank you for the reminder, I never use spaces in names, so when I have the control, that's ok. I should change my habits, though. – martin Aug 05 '14 at 15:20
@EtanReisner Not sure how is that `while` loop related to my question…Is that to address filenames with spaces in it? Is the combination of find|xargs|gzip|awk impossible? – msciwoj Aug 06 '14 at 22:25
@msciwoj Yes, the `while` loop is to handle arbitrary file names correctly. You could certainly pipe the `find` output to an `xargs` which has support for the `-0` argument (i.e. `GNU xargs`). `xargs` is essentially just a glorified `while` loop with the added detail of being able to operate on multiple entries at once when that is possible (which it isn't in this case). – Etan Reisner Aug 06 '14 at 22:38
`zcat` (or `gzcat` on some systems) is equivalent to `gunzip -c` – arekolek Jun 30 '16 at 08:36

msciwoj · Answer 2 · 2014-08-06T23:04:26.983

0

this is using combination of xargs and sh (to be able to use pipe on two commands: gzip and awk):

find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk >> baddata.txt'

I'm wondering if there's any bad practice with the above approach…

edited Aug 06 '14 at 23:04

answered Aug 06 '14 at 22:32

msciwoj

772
7
23

That find command will not recurse into directories (unless they happen to be named to match that pattern). Just FYI. – Etan Reisner Aug 06 '14 at 22:41
You are going to need to wrap that entire pipeline in `{ ...; }` and do the redirection on the outside of that or use `>>` in the `sh` script to avoid clobbering the contents of that file each time `xargs` loops. – Etan Reisner Aug 06 '14 at 22:44
good spot, corrected. Files are huge so the latter makes more sense here – msciwoj Aug 06 '14 at 23:03
I don't know that I think that those two options are materially (or at all) different with respect to how they function/perform related to file size. Redirection inside means `N` redirections. Redirection outside means `1` redirection. Beyond that I don't know that they differ in any real way. – Etan Reisner Aug 06 '14 at 23:21
I thought the `{...;} would accumulate the output first (hence my concern) and only then redirect to file in one go. Now that you pointed it, you may be actually right saying there's no difference and the single redirection happens throughout multiple gzip|awk calls… – msciwoj Aug 06 '14 at 23:36
It will buffer output (but so will the internal redirections). For this case I don't think there is any difference. There might be cases (where the input is coming from a file directly or something) where it will make a difference (in held open file handles or something) but I don't believe this is one of them. But I'm speculating a bit out of my concrete knowledge space at this point. – Etan Reisner Aug 07 '14 at 00:06
actually interested in that syntax - could you give an example of how would that look like ("need to wrap that entire pipeline in `{ ...; }`")? – msciwoj Aug 08 '14 at 20:01
`{ find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk; } > baddata.txt` – Etan Reisner Aug 08 '14 at 22:36
Thanks, I'll accept your answer as it's really `awk -v` that does the trick in all variants. Cheers! – msciwoj Aug 10 '14 at 12:29
single quote missing, should be: `{ find *.gz -print0 | xargs -0 -I fname sh -c 'gzip -dc fname | gawk -v origfile="fname" -f printbadrowsonly.awk'; } > baddata.txt`, thanks – msciwoj Aug 10 '14 at 17:39

AWK to process compressed files and printing original (compressed) file names

2 Answers2