-5

I have the following

444, 1234, (1234), 3453534, 43534543

I want the output to be

444, (1234), 3453534, 43534543

I know awk would be the best solution.

But I am not able to come up with the solution

The new line can have the unique column placed anywhere. and it is necessary to have only the value in the bracket.

In all, we need to get rid of the value which is there without a bracket if a duplicate.

e.g. if we have the columns as::

(1234) 1234 ----> we want it to be (1234)

Thanks a lot

mpapec
  • 50,217
  • 8
  • 67
  • 127
nirvanastack
  • 455
  • 2
  • 5
  • 13
  • Are the parentheses part of the data? – Ray Toal Nov 07 '14 at 03:56
  • How `1234,` a duplicate of `(1234),` ? – Avinash Raj Nov 07 '14 at 03:56
  • I mean, not exactly. But I dont want the values without the braces to be there if exists. – nirvanastack Nov 07 '14 at 03:57
  • Do you really care if it's `1234` or `(1234)` that's printed? It's a lot easier to print the first time a string is seen and ignore subsequent occurrences than to print only the last time its seen. Will duplicate entries always occur contiguously or could they be spread across the line, e.g. `a, b, (a), c, a`? In the example I just gave should `a` be printed because it's the last occurrence or should `(a)` be printed because it exists. – Ed Morton Nov 07 '14 at 04:02
  • Yes. I need only the bracket values ot be printed. – nirvanastack Nov 07 '14 at 04:04
  • They will be spread across the line. – nirvanastack Nov 07 '14 at 04:04
  • 2
    and should they be printed in the slot the bracketed entry existed in or the slot the first non-bracketed entry appeared in or somewhere else. Honestly, with the tiniest bit of effort you could do a lot better with your sample input and expected output to show the intricacies of the problem. – Ed Morton Nov 07 '14 at 04:06
  • [This](http://stackoverflow.com/questions/2978361/uniq-in-awk-removing-duplicate-values-in-a-column-using-awk) will be help out it seems. – Arnab Nandy Nov 07 '14 at 04:07
  • Ed Morton: I am sorry if that's not very clear. The entry can be placed anywhere. Irrespective of its previous place – nirvanastack Nov 07 '14 at 04:08
  • 4
    nirvanastack please update your question to show sample input that covers all the cases I've asked about plus the other cases you know about that I haven't thought to ask about, and the expected output to match. @Skynet - no the answer given in that question is WAY too complicated for this question (and for that one too, best I can tell at a glance). – Ed Morton Nov 07 '14 at 04:10
  • I am dying to get an answer from you people. :P. This thing has been bugging me since a while – nirvanastack Nov 07 '14 at 04:34
  • Try using perl hash to remove duplicates and then try to match only the baracketed value . It may help you and my question is that the duplicate values will be on the same line or anywhere in the file ? – Praveen Nov 07 '14 at 05:34
  • I do see what you want, but can not see any simple solution to test all fields and preserve only `(...)` if there are duplicates. – Jotne Nov 07 '14 at 08:13
  • 1
    What result do you want for the line `1234, (1234), 1234, (1234)`? And are the spaces after the commas really there in the file? – Borodin Nov 07 '14 at 10:48

2 Answers2

1

If I make the following assumptions:

  • There's only one unique column per line
  • The delimiter is the same everywhere in the line except at the end: $

Then here's a awk executable file for removing duplicates as stated in the question:

#!/usr/bin/awk -f

BEGIN {FS=", "}

match($0, /\([[:alnum:]]*\)/) {
  p=substr($0, RSTART, RLENGTH)   # pattern to match
  gsub(p "(" FS "|$){1}", "")     # remove duplicates from $0
  sub(FS "$", "")                 # clean up trailing delimiters
}

47

Or, when removing the assumption of only one unique column per line:

#!/usr/bin/awk -f

BEGIN {FS=", "}

{ 
  for(i=1;i<=NF;i++) {
    if(match($0, "\\(" $i "\\)")) { 
      p=substr($0, RSTART, RLENGTH)   # pattern to match
      gsub(p "(" FS "|$){1}", "")     # remove duplicates from $0
    }
  }
  sub(FS "$", "")                     # clean up trailing delimiters
}

47

In each case, $0 is updated using gsub to remove duplicates instead of operating on the individual fields and the 47 evaluates to true to print $0 whether it was altered or not.

n0741337
  • 2,474
  • 2
  • 15
  • 15
0

If I understood well for each input line all the (value) fields has to be parsed and then all value fields have to be skipped. I assume that all field ends with a comma char except the last one.

Here is my suggestion:

awk ' { delete a; s="" # Reset tmp values
  #Search for all (...) fields
  for(i=1;i<=NF;++i) {
    if (match($i,/^\((.*)\),?$/)) {
        num=$i; gsub(/(^\(|\),?$)/,"",num);
        a[num","]=1;
    }
  }
  #Skip all fields contained by a hash
  for(i=1;i<=NF;++i) if(!(($i)(i<NF?"":",") in a)) s=s FS $i;
  # Trim leading field separator and trailing comma (if exists)
  gsub("(^"FS"|,$)","",s);
  print s;
}' inputfile

Input file:

444, 1234, (1234), 3453534, 43534543
444, (1235), 1235, 1235, 1234, 3453534, 43534543
444, (1235), 1235, 1235, 1234, 3453534, 43534543, (1234)
444, 1235, 1235, 1235, 1234, 3453534, 43534543
444, 1234, (1234)
444, (1235), 1235

Output:

444, (1234), 3453534, 43534543
444, (1235), 1234, 3453534, 43534543
444, (1235), 3453534, 43534543, (1234)
444, 1235, 1235, 1235, 1234, 3453534, 43534543
444, (1234)
444, (1235)

I hope this helps a bit!

TrueY
  • 7,360
  • 1
  • 41
  • 46