-1

Big question: I want a list of the unique combinations between two fields in a data frame.

Example data:

A   B
C   D
E   F
B   A
C   F
E   F

I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)

Attempt:

So far I have tried sorting and keeping unique lines:

 sort file | uniq

but of course that produces 5 combinations:

A   B
C   D
E   F
B   A
C   F

I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?

James Brown
  • 36,089
  • 7
  • 43
  • 59
user4670961
  • 127
  • 2
  • 13

3 Answers3

3

The idiomatic awk approach is to order the index parts:

$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A   B
C   D
E   F
C   F
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Is storing actual data in the array preferable to just keeping an index? It seems to me that `awk '$1 FS $2 in seen {next} $2 FS $1 in seen {next} {seen[$1 FS $2]} 1' file` would be easier on memory, even though it's a few more characters of code. – ghoti Apr 23 '17 at 17:22
  • 1
    `seen` will contain exactly the same values either way, the unique set of $1,$2 pairs. Oh, I see what you're saying - no need to keep the count. That'll be a drop in the ocean and slightly more memory for slightly better efficiency. – Ed Morton Apr 23 '17 at 17:27
3

another awk magic

awk '!a[$1,$2] && !a[$2,$1]++' file
karakfa
  • 66,216
  • 7
  • 41
  • 56
2

In awk:

$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A   B
C   D
E   F
C   F

Explained:

($1$2 in a) { next }     # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1   # hash reverse also and output

It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks @EdMorton).

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • @ghoti how it is useful or constructive to make fun of me? I am just learning coding and try hard to ask questions in an organized way following the structure outlined on this site. This is the final step in a longer problem I am working on that is primarily using awk. If you don't have anything nice to say, please just keep it to yourself! – user4670961 Apr 23 '17 at 17:45
  • 2
    @EdMorton True, true. – James Brown Apr 23 '17 at 20:50
  • 2
    Have you forgot a `||` in your solution just before 1? You have include it in your explanation but not in your code. – George Vasiliou Apr 23 '17 at 22:12
  • 1
    @EdMorton You are right, it is clearer. First I missed the `{}`s in your post and didn't quite get the logic (literally)... – James Brown Apr 24 '17 at 15:08
  • @user4670961, my comment was not intended to make fun of you, it was intended to highlight the fact that you are asking for help with `awk`, but there is no `awk` code in your question, which I mentioned in a comment on your question as well. As I keep commenting, StackOverflow is about helping people fix their code. It's not a free coding service. I'm happy to just provide a downvote on the question if that's less stressful for you. I often provide comments instead because I tend to forget about downvotes, and I'd prefer to remove them once questions get improved. Comments help with that. – ghoti Apr 24 '17 at 16:06