Filter a file to find rows that match in one column but differ in another column

Question

I would like to filter a file so that I can obtain rows that match in column 1 and do not match in column 2. In the following example:

00b27c71-a833-4605-9fb3-a2714ac98092    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   8   60 0
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.5   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.7   8   60 0
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000258424.2   8   60 0

I would like to find entires in column 1 that appear exactly twice, and that do NOT match in column 2, i.e. duplicates in the combiation column1,column2 should be ignored. So the expected output would be:

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

What is in columns 3,4,5,etc is not important for filtering, but I do need to retain the information.

I also need to pipe this in from another output that is necessary to read the file and retain the header. So I need something in the format:

samtools view -h file.bam | code that I need > results.bam

I have tried several version of awk, but to no avail. Any help would be much appreciated.

RavinderSingh13 · Answer 1 · 2018-07-10T09:26:28.037

1

EDIT: As per OP it should be in single read from awk so adding it not.w

your_command |  awk '
{
  a[$1]++;
  b[$1 FS $2]++;
  c[$1 FS $2]=$0
}
END{
  for(i in a){
    for(j in b){
      split(j,array," ");
      if(a[i]==2 && b[j]==1 && i==array[1]){  print c[j]  }
    }
}}'

Could you please try following and let me know if this helps you.

awk 'FNR==NR{a[$1]++;b[$1 FS $2]++;next} a[$1]==2 && b[$1 FS $2]==1'  Input_file  Input_file

edited Jul 10 '18 at 09:26

answered Jul 10 '18 at 08:46

RavinderSingh13

130,504
14
57
93

I need to pipe it into awk rather than run it on a file twice – csijcs Jul 10 '18 at 08:51
1

Nice answer, +1 --- (looks like you're not using `c[$1 FS $2]=$0;` and can remove from the first action). – jas Jul 10 '18 at 08:54
@jas, sure removed it now, thanks for letting me know cheers. – RavinderSingh13 Jul 10 '18 at 08:55
I believe this will fail as the second entry of the OP's expected output appears twice. – kvantour Jul 10 '18 at 09:10
@csijcs, could you please confirm if this is helping you? – RavinderSingh13 Jul 10 '18 at 09:31
1

This does work great on small files, but for a large file it takes a very long time. Thank you very much for your help though – csijcs Jul 10 '18 at 09:45

kvantour · Accepted Answer · 2018-07-10T09:48:53.497

I believe what you are after is the following:

awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0 }
     END { for(i in a) {$0=i; if (b[$1]==2) print a[i] } }' file

This outputs :

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048

What it essentially does is it checks if the combination $1 FS $2 is in the array a. If it is not, keep track of the count of $1 in b[$1] and store the full line in a[$1 FS $2]. In the end, print a[i] if the count is correct. Remark that the key of b is obtained by reassigning the key i to $0. This redefines the fields $1 and $2 and $1 is the key you wanted.

note: the above script does not necessarily keep track of the order as array traversal is done in an unspecified order. If you want to keep the order, you need to keep track of the line-index:

awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0; c[NR]=$1 FS $2 }
     END { for(i=1;i<=NR;++i) if(i in c) { $0=c[i]; if (b[$1]==2) print a[$0]}
     }' file

outputing:

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

old answer:

awk '!($1 in a) { a[$1]=$2; b[$1]=$0; next }
     !match($2,a[$1]){a[$1]=a[$1] FS $2; b[$1]=b[$1] ORS $0}
     END { for (i in a) if (gsub(FS,FS,a[i]) == 1) print b[i] }' file

This outputs :

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

What it essentially does is keep track of two arrays (a and b) both indexed by the first column. If, array a does not contain the element of column $2 then it adds it to the string a[$1]. It also stores the full line in b[$1] separated by ORS. In the end, we count how many fields are in a[i], if it is two, print b[i].

This appears to be working, though I may need to modify in the future in order to find entries in the first column that appear > 3 times and I don't see quite how to do that. I don't care about the last 3 columns for filtering, but I do need to retain that information — csijcs, Jul 10 '18 at 09:19
@csijcs I have written down a simpler version which easily allows you to do this edit. — kvantour, Jul 10 '18 at 09:21
Beautiful! Thank you! And thank you for the explanation on how it works — csijcs, Jul 10 '18 at 09:23
@RomanPerekhrest true, they _might_ appear in the wrong order. However, there was no requirement by the OP that they should be in the same order. — kvantour, Jul 10 '18 at 09:38
@RomanPerekhrest, I have added an extra script which keeps the order. — kvantour, Jul 10 '18 at 09:48
@kvantour, as the Op stated that *Order is not important*, you may choose the 1st approach as it's a bit faster — RomanPerekhrest, Jul 10 '18 at 10:11

Filter a file to find rows that match in one column but differ in another column

2 Answers2