BASH Split CSV Into Multiple Files Based on Column Value

Question

I have a file named fulldata.tmp which contains pipe delimited data (I can change it to comma if needed but generally like using pipe). With a BASH Shell script I would like to split lines out to new files based on the value in column 1 and retain the header. I'm pulling this data via SQL so I can pre-sort if needed but I don't have direct access to the terminal running this script so development and debugging is difficult. I've searched dozens of examples mostly recommending awk but I'm not connecting the dots. This is my core need and below are a couple quality of life options I'd like if it's easy along with example data.

Nice if possible: I would like to specify which columns print to the new files (my example desired output shows I want columns 1-4 out of the initial 5 columns).

Nice if possible: I would like the new files named with a prefix then the data that is being split on followed by extension: final_$col1.csv

GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future

final_ABC.csv

ABC|001|2022-09-15|True

final_DEF.csv

DEF|001|2022-09-16|False

final_GHI.csv

GHI|002|2022-10-17|True

Jetchisel · Accepted Answer · 2022-09-16T19:04:00.343

2

Maybe awk

awk -F'|' -v OFS='|' 'NR>1{print $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp

Check the created csv files and it's content.

tail -n+1 final*.csv

Output

==> final_ABC.csv <==
ABC|001|2022-09-15|True

==> final_DEF.csv <==
DEF|001|2022-09-16|False

==> final_GHI.csv <==
GHI|002|2022-10-17|True

Here is how I would do the header.

IFS= read -r head < fulldata.tmp

Then use the variable to awk.

 awk -F'|' -v header="${head%|*}" 'NR>1{printf "%s\n%s|%s|%s|%s\n", header, $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp

Run tail again to check.

tail -n+1 final*.csv

Output

  ==> final_ABC.csv <==
GROUPID|LABEL|DATE|ACTIVE
ABC|001|2022-09-15|True

==> final_DEF.csv <==
GROUPID|LABEL|DATE|ACTIVE
DEF|001|2022-09-16|False

==> final_GHI.csv <==
GROUPID|LABEL|DATE|ACTIVE
GHI|002|2022-10-17|True

You did find a solution with pure awk.

edited Sep 16 '22 at 19:04

answered Sep 16 '22 at 00:50

Jetchisel

7,493
2
19
18

1

Thank you @Jetchisel This does the splitting and column selection perfect. I'll mark as accepted answer. Do you know of a way to retain the header? – sandsawks Sep 16 '22 at 01:01
1

I took what you provided it and tweaked it bit with reviewing some other threads. I don't have to have the header in the original file so i removed it from my program and used this code which was modded from yours to manually add the header in. ```awk -F'|' -v OFS='|' 'prev!=$1{close(prev".csv"); print "GROUPID|LABEL|DATE|ACTIVE" > ("final_"$1".csv")}{print $1, $2, $3, $4 > ("final_"$1".csv");prev=$1}' fulldata.tmp ``` – sandsawks Sep 16 '22 at 02:14

score 0 · Answer 2 · answered Sep 16 '22 at 01:10

This works and preserves the header which I believe was a requirement.

cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done

I think a pure awk solution is better though.

BASH Split CSV Into Multiple Files Based on Column Value

2 Answers2