2

I'm trying to remove duplicate lines in a very large file (~100,000 records) according to the values of the first two columns without taking into account their order, and then print those fields + the other columns.

So, from this input:

A B XX XX
A C XX XX
B A XX XX
B D XX XX
B E XX XX
C A XX XX

I'd like to have:

A B XX XX
A C XX XX
B D XX XX
B E XX XX

(That is, I want to remove 'B A' and 'C A' because they already appear in the opposite order; I don't care about what's in the next columns but I want to print it too)

I've the impression that this should be easy to do with awk + arrays, but I can't come with a solution.

So far, I'm tinkering with this:

awk '
NR == FNR {
h[$1] = $2   
next
}
$1 in h {
print h[$1],$2}' input.txt

I'm storing the second column in an array indexed by the first (h), and then check if there are occurrences of the first field in the stored array. Then, print the line. But something's wrong and I have no output.

I'm sorry because my code is not helpful at all but I'm kind of stuck with this.

Do you have any ideas?

Thanks a lot!

xgrau
  • 299
  • 1
  • 2
  • 11

2 Answers2

5

Just keep track of the things that appear on the two formats:

$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
A B XX XX
A C XX XX
B D XX XX
B E XX XX

Which is equivalent to awk '!(seen[$1,$2]++ || seen[$2,$1]++)' file.

Note it is also equivalent to not having ++ the second expression (see comments):

awk '!seen[$1,$2]++ && !seen[$2,$1]' file

Explanation

The typical approach to print unique lines is:

awk '!seen[$0]++' file

This creates an array seen[] whose indexes are the lines that have appeared so far. So if it is new, seen[$0] is 0 and gets incremented to 1. But previously it is printed because the expression ! var ++ evaluates ! var first (and in awk, True triggers the action of printing the current line). When the line has been seen already, seen[$0] has a positive value, so !seen[$0] is false and doesn't trigger the printing action.

In your case you want to keep track of what appeared, no matter the order, so what I am doing is to store the indexes in both possible positions.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • 1
    Thanks a lot! Can you explain a little bit the syntax of the "seen" command, please? -- EDIT: I've seen that you've already done it, thanks. – xgrau Oct 26 '15 at 14:02
  • 1
    @EdMorton ops, in fact I just forgot using it, funny that it works either way : ) – fedorqui Oct 26 '15 at 14:06
  • 1
    @XGrau `seen` is simply an assosiate array where fedorqui store the two different combinations of values, eg: `seen["AB"]` and `seen["BA"]`. The name is arbitrary - could have been `z` for that matter. – Andreas Louv Oct 26 '15 at 14:09
  • 2
    @EdMorton You won't need the last `++`. Consider this: `A B`, `A B`, `B A` -> `!0 && !0`, `!1 && !1` , `!0 && !2`. Where it could as easily have been: `!0 && !0`, `!1 && !0`, `!0 && !2` – Andreas Louv Oct 26 '15 at 14:11
  • 3
    @dev-null Yeah, I finally broke down and tested it and figured out that on a subsequent line `seen[b,a]` is already 1 from the previous `seen[a,b]++`. I much prefer the script with them though as it makes it clearer (to me at least). – Ed Morton Oct 26 '15 at 14:13
  • @dev-null Was just going to comment the same! – 123 Oct 26 '15 at 14:15
0

use as below

$awk '{if( $1$2 in a == 0 && $2$1 in a == 0 ) a[$1$2]=$0; } END{ for(i in a)print a[i]; }' input.txt

Explanation: command is storing the record in array (a) with array key as combination of first and second field (i.e $1$2 and $2$1) is not already present in array. Once complete file is read then print the array (a).

# ($1$2 in a) => checks if there is any key with $1$2 in array a
# if it's not present then it return 0 
# and if both combination $1$2 and $2$1 are not present then store the record in array a
if( $1$2 in a == 0 && $2$1 in a == 0 ) a[$1$2]=$0;

# below print the array a (which stores complete unique record) at the end 
END{ for(i in a) print a[i]; }'
narendra
  • 1,278
  • 1
  • 7
  • 8