0

I have a file where I'd like to check the content of 4 columns, the order can be reversed between couples of these columns, that means if the columns are a,b,c,d then they can appear also as c,d,a,b. So columns a,b and c,d are locked but can be swapped between each other.

I found a similar post here remove redundancy in a file based on two fields, using awk however the solution does not work at all

Even with just two columns

a;b
d;a
b;a
r;f
r;y
a;b
a;d

If I apply the solutions provided and given as correct I end up with duplicates

$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
a;b
d;a
b;a
r;f
r;y
a;d

As you can see there's still a;b and b;a

Any suggestion to make this work, considering also there would be four columns, for example

Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Berlin;Germany;Dallas;Texas
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas

Should give

Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas

Note that the last line should not be deleted because that's a different record, so a,b and c,d should be considered as locked couples, so a,b,c,d or c,d,a,b should be considered as a duplicate but not other cases.

  • Is the leading space part of the file? Remove it from the description if it does not exist – Inian Sep 19 '19 at 10:02
  • I mark this as a duplicate as the OP knows how to solve the problem, but forgot about defining the field separator. The duplicate uses `:` while here it is `;`. This is, however, not a big difference. At the same time, this post should also be marked as a duplicate for the post mentioned in the OP. – kvantour Sep 19 '19 at 10:28
  • @kvantour: Agree with you for the first part. For the part about the row with multiple fields just setting the `;` alone won't work. Going with the earlier logic, having 2^4 variations of the fields won't look good though. If there is a dupe for having to group lines by multiple words, that would apply here though – Inian Sep 19 '19 at 10:31
  • @Inian The OP states in the beginning that _if the columns are a,b,c,d then they can appear also as c,d,a,b._ So essentially the problem is identical. `awk -F ';' '!seen[$1,$2,$3,$4]++ && !seen[$3,$4,$2,$1]++' file`. In this case there are no 2^4 variations. – kvantour Sep 19 '19 at 10:34
  • Apologies, the leading space was not part of the file, I edited it. – James Biffi Sep 19 '19 at 10:34

1 Answers1

3

Well for the original example with two fields, you have missed defining the ; as the input field separator. The same would have worked had you run it as

awk -F';' '!seen[$1,$2]++ && !seen[$2,$1]++' file

For multiple records in a row on a de-limiter, it is better to sort those records by alphabetical order and use the logic. The below logic works irrespective of the order of the elements in a line.

Needs GNU awk because of asort() function.

The input and output delimiters are not needed for the below case, because on every line we use the records split by ; to construct the unique key and print the whole line when it is unique.

awk '{
       split($0, arr, ";"); key=""; 
       asort(arr);
       for (i=1; i<=length(arr); i++) { 
         key = ( key FS arr[i] )  
       }
    }!unique[key]++' file

In so called one-liner (aka unreadable) way

awk '{ split($0, arr, ";"); asort(arr); key=""; for (i=1; i<=length(arr); i++) { key = ( key FS arr[i])  }; }!unique[key]++' file

As noted in the comments, if the possible alternates for a,b,c,d is just c,d,a,b then doing below would just suffice

awk -F';' '!seen[$1,$2,$3,$4]++ && !seen[$3,$4,$1,$2]++' file 
Inian
  • 80,270
  • 14
  • 142
  • 161
  • Sorry, but it does not consider a,b and c,d as couples. For example take Dallas;Texas;Berlin;Germany Paris;France;Tokyo;Japan Berlin;Germany;Dallas;Texas Florence;Italy;Dublin;Ireland Berlin;Germany;Texas;Dallas With your script also the last line will be deleted but that would be a separate record. – James Biffi Sep 19 '19 at 10:38
  • It seems that awk -F';' '!seen[$1,$2,$3,$4]++ && !seen[$3,$4,$1,$2]++' file does the job – James Biffi Sep 19 '19 at 10:45