0

I would like to delete duplicated chunks of strings in a file

One chunk is comprised of four lines such as:

path name

starting point

ending point

voltage number

I would like to delete duplicated chunks on the same row(?) if the ending point is duplicated.
For example, ending points of the first and the second chunk are same in the first row and I would like to only keep the first chunk. Therefore, the second chunk is removed on the first row.

In the second row, ending points of the first and the third chunk are same and keep the first chunk.

input.txt:

path_sparc_ffu_dp_out_1885  path_sparc_ffu_dp_out_2759  path_sparc_ffu_dp_out_3115
R_1545/Q    R_1541/Q    R_1545/Q
dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[2]
0.926208    0.910592    0.905082
path_sparc_ffu_dp_out_699   path_sparc_ffu_dp_out_712   path_sparc_ffu_dp_out_819
R_1053/Q    R_1053/Q    R_1053/Q
dp_ctl_synd_out_low[2]  dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[2]
0.945436    0.945436    0.9435

output.txt:

path_sparc_ffu_dp_out_1885  path_sparc_ffu_dp_out_3115
R_1545/Q        R_1545/Q
dp_ctl_synd_out_low[6]      dp_ctl_synd_out_low[2]
0.926208        0.905082
path_sparc_ffu_dp_out_699   path_sparc_ffu_dp_out_712   
R_1053/Q    R_1053/Q    
dp_ctl_synd_out_low[2]  dp_ctl_synd_out_low[6]  
0.945436    0.945436    

I think awk/sed can do this work. Any help is appreciated.

Best,

Jaeyoung

Jaeyoung Park
  • 339
  • 1
  • 6
  • I tried 'uniq' that only show uniq strings, but don't know how to show uniq chunks, and I tried awk, but I am a new on awk. So please. any suggestion is appreciated. – Jaeyoung Park May 12 '16 at 19:05
  • 1
    I knew this sounded familiar. IMHO you'll do better trying to fix your previous Q (http://stackoverflow.com/questions/37141953/relocation-strings-using-awk-sed-from-a-index-file) rather than this approach. This new layout makes it more difficult to understand your problem. Good luck. – shellter May 12 '16 at 22:37
  • Hi @jaeyoung-park, all your chunks contain duplicates or only some of them? – ej_f May 13 '16 at 15:09

1 Answers1

1

This solution works assuming your input data:

$ sed -r 's/(dp_ctl_synd_out_low\[[0-9]\])(.+)(\1)/\1 \2 -/g' input.txt | paste - - - - | awk '{ $8=="-"?dup=2:dup=3; for(i=1;i<=NF;i++){if(dup!=((i-1)%3+1)){print $i}} }' | paste - -
path_sparc_ffu_dp_out_1885      path_sparc_ffu_dp_out_3115
R_1545/Q        R_1545/Q
dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[2]
0.926208        0.905082
path_sparc_ffu_dp_out_699       path_sparc_ffu_dp_out_712
R_1053/Q        R_1053/Q
dp_ctl_synd_out_low[2]  dp_ctl_synd_out_low[6]
0.945436        0.945436

I will explain the solution step by step as follow:

Substitute the duplicate ending point with a minus sign:

sed -r 's/(dp_ctl_synd_out_low\[[0-9]\])(.+)(\1)/\1 \2 -/g' input.txt

Show the chunk in one row:

paste - - - -

Using awk, exclude the duplicate column (second or third):

# find if the duplicate is in the second or in the third column
$8=="-"?dup=2:dup=3;
# exclude all the seconds or thirds fields (previous calculated)
for(i=1;i<=NF;i++){
    if(dup!=((i-1)%3+1)){
        print $i
    }
}

Finally paste to show the data in the original form:

paste - -

I hope this can help you.

ej_f
  • 460
  • 3
  • 9