How to delete duplicate column using awk script

Question

I have the following

444, 1234, (1234), 3453534, 43534543

I want the output to be

444, (1234), 3453534, 43534543

I know awk would be the best solution.

But I am not able to come up with the solution

The new line can have the unique column placed anywhere. and it is necessary to have only the value in the bracket.

In all, we need to get rid of the value which is there without a bracket if a duplicate.

e.g. if we have the columns as::

(1234) 1234 ----> we want it to be (1234)

Thanks a lot

I mean, not exactly. But I dont want the values without the braces to be there if exists. — nirvanastack, Nov 07 '14 at 03:57
Do you really care if it's `1234` or `(1234)` that's printed? It's a lot easier to print the first time a string is seen and ignore subsequent occurrences than to print only the last time its seen. Will duplicate entries always occur contiguously or could they be spread across the line, e.g. `a, b, (a), c, a`? In the example I just gave should `a` be printed because it's the last occurrence or should `(a)` be printed because it exists. — Ed Morton, Nov 07 '14 at 04:02
and should they be printed in the slot the bracketed entry existed in or the slot the first non-bracketed entry appeared in or somewhere else. Honestly, with the tiniest bit of effort you could do a lot better with your sample input and expected output to show the intricacies of the problem. — Ed Morton, Nov 07 '14 at 04:06
[This](http://stackoverflow.com/questions/2978361/uniq-in-awk-removing-duplicate-values-in-a-column-using-awk) will be help out it seems. — Arnab Nandy, Nov 07 '14 at 04:07
Ed Morton: I am sorry if that's not very clear. The entry can be placed anywhere. Irrespective of its previous place — nirvanastack, Nov 07 '14 at 04:08
nirvanastack please update your question to show sample input that covers all the cases I've asked about plus the other cases you know about that I haven't thought to ask about, and the expected output to match. @Skynet - no the answer given in that question is WAY too complicated for this question (and for that one too, best I can tell at a glance). — Ed Morton, Nov 07 '14 at 04:10
I am dying to get an answer from you people. :P. This thing has been bugging me since a while — nirvanastack, Nov 07 '14 at 04:34
Try using perl hash to remove duplicates and then try to match only the baracketed value . It may help you and my question is that the duplicate values will be on the same line or anywhere in the file ? — Praveen, Nov 07 '14 at 05:34
I do see what you want, but can not see any simple solution to test all fields and preserve only `(...)` if there are duplicates. — Jotne, Nov 07 '14 at 08:13
What result do you want for the line `1234, (1234), 1234, (1234)`? And are the spaces after the commas really there in the file? — Borodin, Nov 07 '14 at 10:48

score 1 · Answer 1 · answered Nov 09 '14 at 06:32

If I make the following assumptions:

There's only one unique column per line
The delimiter is the same everywhere in the line except at the end: $

Then here's a awk executable file for removing duplicates as stated in the question:

#!/usr/bin/awk -f

BEGIN {FS=", "}

match($0, /\([[:alnum:]]*\)/) {
  p=substr($0, RSTART, RLENGTH)   # pattern to match
  gsub(p "(" FS "|$){1}", "")     # remove duplicates from $0
  sub(FS "$", "")                 # clean up trailing delimiters
}

47

Or, when removing the assumption of only one unique column per line:

#!/usr/bin/awk -f

BEGIN {FS=", "}

{ 
  for(i=1;i<=NF;i++) {
    if(match($0, "\\(" $i "\\)")) { 
      p=substr($0, RSTART, RLENGTH)   # pattern to match
      gsub(p "(" FS "|$){1}", "")     # remove duplicates from $0
    }
  }
  sub(FS "$", "")                     # clean up trailing delimiters
}

47

In each case, $0 is updated using gsub to remove duplicates instead of operating on the individual fields and the 47 evaluates to true to print $0 whether it was altered or not.

score 0 · Answer 2 · answered Nov 07 '14 at 12:48

If I understood well for each input line all the (value) fields has to be parsed and then all value fields have to be skipped. I assume that all field ends with a comma char except the last one.

Here is my suggestion:

awk ' { delete a; s="" # Reset tmp values
  #Search for all (...) fields
  for(i=1;i<=NF;++i) {
    if (match($i,/^\((.*)\),?$/)) {
        num=$i; gsub(/(^\(|\),?$)/,"",num);
        a[num","]=1;
    }
  }
  #Skip all fields contained by a hash
  for(i=1;i<=NF;++i) if(!(($i)(i<NF?"":",") in a)) s=s FS $i;
  # Trim leading field separator and trailing comma (if exists)
  gsub("(^"FS"|,$)","",s);
  print s;
}' inputfile

Input file:

444, 1234, (1234), 3453534, 43534543
444, (1235), 1235, 1235, 1234, 3453534, 43534543
444, (1235), 1235, 1235, 1234, 3453534, 43534543, (1234)
444, 1235, 1235, 1235, 1234, 3453534, 43534543
444, 1234, (1234)
444, (1235), 1235

Output:

444, (1234), 3453534, 43534543
444, (1235), 1234, 3453534, 43534543
444, (1235), 3453534, 43534543, (1234)
444, 1235, 1235, 1235, 1234, 3453534, 43534543
444, (1234)
444, (1235)

I hope this helps a bit!

Thanks a lot TrueY for your efforts. – nirvanastack Nov 08 '14 at 06:58 — nirvanastack, Nov 08 '14 at 06:58
@nirvanastack: Is it what you need? – TrueY Nov 08 '14 at 13:08 — TrueY, Nov 08 '14 at 13:08

How to delete duplicate column using awk script

2 Answers2