I'm trying to remove duplicate lines in a very large file (~100,000 records) according to the values of the first two columns without taking into account their order, and then print those fields + the other columns.
So, from this input:
A B XX XX
A C XX XX
B A XX XX
B D XX XX
B E XX XX
C A XX XX
I'd like to have:
A B XX XX
A C XX XX
B D XX XX
B E XX XX
(That is, I want to remove 'B A' and 'C A' because they already appear in the opposite order; I don't care about what's in the next columns but I want to print it too)
I've the impression that this should be easy to do with awk + arrays, but I can't come with a solution.
So far, I'm tinkering with this:
awk '
NR == FNR {
h[$1] = $2
next
}
$1 in h {
print h[$1],$2}' input.txt
I'm storing the second column in an array indexed by the first (h), and then check if there are occurrences of the first field in the stored array. Then, print the line. But something's wrong and I have no output.
I'm sorry because my code is not helpful at all but I'm kind of stuck with this.
Do you have any ideas?
Thanks a lot!