Remove duplicate records from a csv file considering single column

Question

I have a file with records in such a type-

,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
3,22DE17,BA,S6CD6728,24JA13,6A
4,12FE18,AA,S6FD7688,25DA15,7D

I want to remove duplicate records considering 4th column which has "S6CD6728" these type of record and skipping first row which is

",laac_repo,cntrylist,idlist,domlist,type list"

I have tried

awk '{a[$4]++}!(a[$4]-1)' filename

And also tried

awk 'FNR > 1 {a[$4]++}!(a[$4]-1)' filename

The expected output is-

,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D

P.S file has more than 10 million records, please suggest solution w.r.t that.( If any script given much appreciated, instead of single command).

please update the question to show the (correct) expected output — markp-fuso, Feb 15 '22 at 16:08

Darkman · Answer 1 · 2022-02-15T18:02:40.103

1

What about this:

awk -F, 'FNR>1 && \!seen[$4]++' filename

1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D

awk -F, '\!seen[$4]++' filename

,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D

edited Feb 15 '22 at 18:02

answered Feb 15 '22 at 15:35

Darkman

2,941
2
9
14

Remove duplicate records from a csv file considering single column

1 Answers1