Can I split a CSV file in multiple output files, adding a header to each file, in one go with awk?

Question

I have a CSV file like this

>cat test.in
1|aaa|bbb
1|ccc|ddd
2|xxx|yyy
2|www|zzz
2|www|ttt

and I want to split it in separate files:

the naming convention should be prefix.FIELD1.FIELD2.out
FIELD1 should not be in the output file
every output file should have a header

Is there a neat way to do it in one go with awk?

So far I've managed to have awk create the output files but I can't make it add the header, so I just loop over the output files and add it afterwards

>cat script.sh
#!/bin/bash

FIELD_SEPARATOR="|"
OUTPUT_HEADER="Key|Value"

awk '{FS=OFS="'${FIELD_SEPARATOR}'"; print $2,$3> "prefix." $1 "." $2 ".out"}' test.in

# add the header to all the output files
echo $OUTPUT_HEADER > header
for filename in $(ls prefix.*.out 2>/dev/null); do
  cat header $filename > $filename.tmp && mv $filename.tmp $filename
done
rm header

which gives the expected output

>ls prefix.*.out
prefix.1.aaa.out  prefix.1.ccc.out  prefix.2.www.out  prefix.2.xxx.out

>cat prefix.1.aaa.out
Key|Value
aaa|bbb

>cat prefix.1.ccc.out
Key|Value
ccc|ddd

>cat prefix.2.www.out
Key|Value
www|zzz
www|ttt

>cat prefix.2.xxx.out
Key|Value
xxx|yyy

David C. Rankin · Accepted Answer · 2022-12-01T14:41:31.053

3

A simple way to do this in awk is keep an array of the filenames created. If the filename isn't already in the array, output the header and then append your field-2 and field-3 as contents. A check that the number of fields is 3 helps ignore blank lines, etc.

You can write your script as:

awk -F"|" '
  BEGIN { hdr="Key|Value"; OFS=FS } 
  NF==3 { 
    ofn="prefix." $1 "." $2 ".out"
    if (! (ofn in arr)) { 
      print hdr > ofn 
    }
    arr[ofn] = 1
    print $2,$3 >> ofn 
  }
' test.in

Or if you like long 1-liners:

awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in

Example Use/Output

$ awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in

Result:

$ l
total 28
drwxr-xr-x 2 david david 4096 Nov 29 14:07 .
drwxr-xr-x 7 david david 4096 Nov 29 13:57 ..
-rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.1.aaa.out
-rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.1.ccc.out
-rw-r--r-- 1 david david   36 Nov 29 14:07 prefix.2.www.out
-rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.2.xxx.out
-rw-r--r-- 1 david david   50 Nov 29 13:58 test.in

with, e.g.

$ for i in prefix*; do printf "\nfile: %s\n" "$i"; cat "$i"; done

file: prefix.1.aaa.out
Key|Value
aaa|bbb

file: prefix.1.ccc.out
Key|Value
ccc|ddd

file: prefix.2.www.out
Key|Value
www|zzz
www|ttt

file: prefix.2.xxx.out
Key|Value
xxx|yyy

A single awk command is all you need. Let me know if you have questions.

Alternative for Ancient Sun SparcStation without indx in arr Running SunOS (5.10)

Per your comment about the Old Sun SparcStation not implementing indx in array in the awk that is on the machine, an alternative is to build an indexed array and then scan the array to determine whether that output filename already exists and set a flag to indicate the status. This will be slightly less efficient than the indx in array approach, but shouldn't be off by much.

Since the version of awk on SunOS (5.10) will not even response to awk --version, there is no telling how old it is. As mentioned, the last time I worked on a Sun SparkStation was somewhere in 1991-1993, great machines for the time before the internet :).

Give the following a try:

awk -F"|" '
  BEGIN { hdr="Key|Value"; OFS=FS } 
  NF==3 { 
    ofn = "prefix." $1 "." $2 ".out"
    exists = 0
    for (i=0; i<n; i++) {     # loop index arr checking for ofn
      if (arr[i] == ofn) {    # if already in arr
        exists = 1            # set exists flag, break
        break
      }
    }
    if (exists == 0) {        # if not in arr, output header, add to arr
      print hdr > ofn
      arr[n++] = ofn
    }
    print $2,$3 >> ofn        # output data to file
  }
' test.in

(operation and output is the same)

edited Dec 01 '22 at 14:41

answered Nov 29 '22 at 20:14

David C. Rankin

81,885
6
58
85

Yes no need to loop in shell, awk is best – anubhava Nov 30 '22 at 06:19
I get this error, due to the `if` `awk: syntax error near line 5 awk: illegal statement near line 5 awk: syntax error near line 9 awk: bailing out near line 9` if I remove the `if` and print the header every time (which is not what I want, just for the sake of debugging), then it works – Giuseppe Marco Boscardin Nov 30 '22 at 10:55
@GiuseppeMarcoBoscardin so it doesn't like inverting the result of `ofn in arr`. What OS and what `awk` are you using. Post the first line of `awk --version` below. E.g. something like `"GNU Awk 5.2.1, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.1.1, GNU MP 6.2.1)"`, as well as telling us what Operating System (and Linux distro, or Mac) you are using. – David C. Rankin Nov 30 '22 at 16:15
@GiuseppeMarcoBoscardin - instead of inverting the result, try `if ((ofn in arr) == 0) {`. That should work in any of the `awk` flavors. – David C. Rankin Nov 30 '22 at 16:26
even `if ((ofn in arr) == 0)` gives the same error. This is an old SPARC machine with an ancient version of SunOS (5.10) and updating is not an option. That might be the cause. By the way, `awk --version` just hangs and doesn't return anything – Giuseppe Marco Boscardin Dec 01 '22 at 11:13
@GiuseppeMarcoBoscardin -- oh, yes, that is the issue. I haven't used a Sun SparcStation since 91 or 92. Before SCO sued the world over owing Unix... If it is that old, it is likely you don't even have `indx in array` to begin with. Let me think a bit on another way to do the same thing. `indx in array` was created for just this circumstance, and it highly efficient, but we can probably find something else that will come close. How big is your actual file so I have some type of understanding of how much information will be stored. – David C. Rankin Dec 01 '22 at 13:53
SunOS 5.10 isn't as old as I thought. It's only 17 years old -- and unbelievably -- still maintained. It's been closed source since 2010 when Oracle acquired Sun. Would be interesting to find out what `awk` it has. Maybe `man awk` or give `awk --help` or `awk -h` a go and see what options you have. – David C. Rankin Dec 01 '22 at 14:53
the only one of those 3 commands that works is `man awk` but there's no mention of the actual `awk` version, just a generic `SunOS 5.10 Last change: 22 Jun 2005` – Giuseppe Marco Boscardin Dec 02 '22 at 11:21
As per the file size, it's not huge (11 fields and <100k lines ) but I wouldn't mind finding out an efficient solution so that I could eventually use it for files with millions or tens of millions of lines – Giuseppe Marco Boscardin Dec 02 '22 at 11:24
Sorry, I just noticed you edited with the alternative and I confirm that it is working, thanks! – Giuseppe Marco Boscardin Dec 02 '22 at 15:06
1

@GiuseppeMarcoBoscardin - efficiency should be about the same. The only real difference between the two is in the first, the filename is stored as an index, in the second, the filename is stored as the element value. For 100K files, there may be no drop in efficiency at all. It all boils down to whether `indx in array` scan of indexes is faster than the linear scan of values. I'm sure `indx in array` does some type of *divide-and-conqueror* as opposed to a linear search -- but we are getting down in the weeds now. – David C. Rankin Dec 02 '22 at 15:40
@GiuseppeMarcoBoscardin - also, if either one of the answers has answered your question, please take a look at: [What should I do when someone answers my question?](http://stackoverflow.com/help/someone-answers) – David C. Rankin Dec 02 '22 at 15:42
Thanks for the thorough explanation, I really appreciate your help! – Giuseppe Marco Boscardin Dec 05 '22 at 11:17
@GiuseppeMarcoBoscardin you are very welcome. Good luck with your scripting (even on that old box `:)` – David C. Rankin Dec 06 '22 at 03:33

Paul Hodges · Answer 2 · 2022-12-01T15:33:51.420

NO - nononono.... Don't so this -

for filename in $(ls prefix.*.out 2>/dev/null)

Do this -

for filename in prefix.*.out

c.f. https://mywiki.wooledge.org/ParsingLs

But your question, as I read it, is basically whether you can skip that whole structure and just have awk handle that for you as well, the the answer is yes, you certainly can, and all in one pass.

$:  awk -v sep='|' 'BEGIN{FS=OFS=sep} { f="prefix."$1"."$2".out"; print "Key","Value" > f; print $2,$3 > f; }' test.in

$: grep . prefix*
prefix.1.aaa.out:Key|Value
prefix.1.aaa.out:aaa|bbb
prefix.1.ccc.out:Key|Value
prefix.1.ccc.out:ccc|ddd
prefix.2.www.out:Key|Value
prefix.2.www.out:www|zzz
prefix.2.www.out:Key|Value
prefix.2.www.out:www|ttt
prefix.2.xxx.out:Key|Value
prefix.2.xxx.out:xxx|yyy

edit

I feel stupid for putting the header in again and again on every line...

$: awk -v sep='|' 'BEGIN{FS=OFS=sep} { f="prefix."$1"."$2".out"; if (!header[f]) { header[f]=1; print "Key","Value" > f; } print $2,$3 > f; }' test.in

$: grep . prefix*
prefix.1.aaa.out:Key|Value
prefix.1.aaa.out:aaa|bbb
prefix.1.aaa.out:aaa|bbb
prefix.1.aaa.out:aaa|bbb
prefix.1.ccc.out:Key|Value
prefix.1.ccc.out:ccc|ddd
prefix.1.ccc.out:ccc|ddd
prefix.1.ccc.out:ccc|ddd
prefix.2.www.out:Key|Value
prefix.2.www.out:www|zzz
prefix.2.www.out:www|ttt
prefix.2.www.out:www|zzz
prefix.2.www.out:www|ttt
prefix.2.www.out:www|zzz
prefix.2.www.out:www|ttt
prefix.2.xxx.out:Key|Value
prefix.2.xxx.out:xxx|yyy
prefix.2.xxx.out:xxx|yyy
prefix.2.xxx.out:xxx|yyy

edit 2

An alternate way to track filenames by scanning a string-list might work -

$: awk -v sep='|' 'BEGIN{FS=OFS=sep; lst="";}
   { f="prefix."$1"."$2".out";
     if (lst !~ f) { lst=f"\n"lst; print "Key","Value" > f; }
     print $2,$3 > f; }' test.in

Thanks for the link about the `for filename in ls`. Regarding the solution, there is a problem because the header is repeated for every line in a given file, `prefix.2.www.out:Key|Value prefix.2.www.out:www|zzz prefix.2.www.out:Key|Value prefix.2.www.out:www|ttt` whereas it should be only at the very beginning of the file. — Giuseppe Marco Boscardin, Nov 30 '22 at 10:33
This now gives `awk: syntax error near line 1 awk: bailing out near line 1` but, as I mentioned in another comment, it might be due to the server being an old SPARC machine with an ancient version of SunOS (5.10), where updating is not an option — Giuseppe Marco Boscardin, Dec 01 '22 at 11:15
Edit 2 is still giving `awk: syntax error near line 1 awk: bailing out near line 1` — Giuseppe Marco Boscardin, Dec 02 '22 at 15:12

Can I split a CSV file in multiple output files, adding a header to each file, in one go with awk?

2 Answers2

edit

edit 2