0

I would like to filter a file so that I can obtain rows that match in column 1 and do not match in column 2. In the following example:

00b27c71-a833-4605-9fb3-a2714ac98092    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af75    ENST00000367142.4   8   60 0
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.5   130 12  4
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000367142.7   8   60 0
00d77e65-466e-4fe6-ad0f-bc6b3f44af74    ENST00000258424.2   8   60 0

I would like to find entires in column 1 that appear exactly twice, and that do NOT match in column 2, i.e. duplicates in the combiation column1,column2 should be ignored. So the expected output would be:

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

What is in columns 3,4,5,etc is not important for filtering, but I do need to retain the information.

I also need to pipe this in from another output that is necessary to read the file and retain the header. So I need something in the format:

samtools view -h file.bam | code that I need > results.bam

I have tried several version of awk, but to no avail. Any help would be much appreciated.

kvantour
  • 25,269
  • 4
  • 47
  • 72
csijcs
  • 47
  • 5

2 Answers2

1

EDIT: As per OP it should be in single read from awk so adding it not.w

your_command |  awk '
{
  a[$1]++;
  b[$1 FS $2]++;
  c[$1 FS $2]=$0
}
END{
  for(i in a){
    for(j in b){
      split(j,array," ");
      if(a[i]==2 && b[j]==1 && i==array[1]){  print c[j]  }
    }
}}'

Could you please try following and let me know if this helps you.

awk 'FNR==NR{a[$1]++;b[$1 FS $2]++;next} a[$1]==2 && b[$1 FS $2]==1'  Input_file  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

I believe what you are after is the following:

awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0 }
     END { for(i in a) {$0=i; if (b[$1]==2) print a[i] } }' file

This outputs :

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048

What it essentially does is it checks if the combination $1 FS $2 is in the array a. If it is not, keep track of the count of $1 in b[$1] and store the full line in a[$1 FS $2]. In the end, print a[i] if the count is correct. Remark that the key of b is obtained by reassigning the key i to $0. This redefines the fields $1 and $2 and $1 is the key you wanted.

note: the above script does not necessarily keep track of the order as array traversal is done in an unspecified order. If you want to keep the order, you need to keep track of the line-index:

awk '!($1 FS $2 in a) { b[$1]++; a[$1 FS $2]=$0; c[NR]=$1 FS $2 }
     END { for(i=1;i<=NR;++i) if(i in c) { $0=c[i]; if (b[$1]==2) print a[$0]}
     }' file

outputing:

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

old answer:

awk '!($1 in a) { a[$1]=$2; b[$1]=$0; next }
     !match($2,a[$1]){a[$1]=a[$1] FS $2; b[$1]=b[$1] ORS $0}
     END { for (i in a) if (gsub(FS,FS,a[i]) == 1) print b[i] }' file

This outputs :

00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000258424.2   12  60 2048
00b27c71-a833-4605-9fb3-a2714ac98091    ENST00000352983.6   157 60  16

What it essentially does is keep track of two arrays (a and b) both indexed by the first column. If, array a does not contain the element of column $2 then it adds it to the string a[$1]. It also stores the full line in b[$1] separated by ORS. In the end, we count how many fields are in a[i], if it is two, print b[i].

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • This appears to be working, though I may need to modify in the future in order to find entries in the first column that appear > 3 times and I don't see quite how to do that. I don't care about the last 3 columns for filtering, but I do need to retain that information – csijcs Jul 10 '18 at 09:19
  • @csijcs I have written down a simpler version which easily allows you to do this edit. – kvantour Jul 10 '18 at 09:21
  • Beautiful! Thank you! And thank you for the explanation on how it works – csijcs Jul 10 '18 at 09:23
  • @RomanPerekhrest true, they _might_ appear in the wrong order. However, there was no requirement by the OP that they should be in the same order. – kvantour Jul 10 '18 at 09:38
  • Order is not important, I can sort after if necessary. – csijcs Jul 10 '18 at 09:39
  • @RomanPerekhrest, I have added an extra script which keeps the order. – kvantour Jul 10 '18 at 09:48
  • @kvantour, as the Op stated that *Order is not important*, you may choose the 1st approach as it's a bit faster – RomanPerekhrest Jul 10 '18 at 10:11