2

I have my data of pairwise DNA sequences showing similarity in the following way..

AATGCTA|1   AATCGTA|2
AATCGTA|2   AATGGTA|3
AATGGTA|3   AATGGTT|8
TTTGGTA|4   ATTGGTA|5
ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
GGCGGTA|10  AATCGTA|2
GGCGGTA|10  TGCGGTA|11
CAGGCA|12   GAGGCA|13

The above is a sample input file, the original file is few millions rows. I want output to be cluster the overlapping ids based on the common elements between the rows and output them to one single line for each cluster, as below

AATGCTA|1   AATCGTA|2   AATGGTA|3   AATGGTT|8   GGCGGTA|10  TGCGGTA|11
TTTGGTA|4   ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
CAGGCA|12   GAGGCA|13 

I am currently trying to cluster them using mcl and also silix, I was not successful in running silix. But the mcl is currently in progress, I would like to know if there are any other ways smart ways of doing this may be in awk or perl. I appreciate some solution, thank you. (this is my first post I am sorry if I have made some mistake)

Just to make it simpler.. is it easy to say my input is,

1   2
2   3
3   8
4   5
5   9
6   7
10  2
10  11
12  13

and I want output to be,

1   2   3   8   10  11
4   5   9
6   7
12  13
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
bala
  • 553
  • 4
  • 7
  • What makes each sequence common? – Chris Seymour Jan 10 '13 at 21:02
  • Each sequence shows some similarity to other sequence, that is how they are grouped, in the end I want to know which group/cluster of sequences form one group, I am interested to group them based on pairwise matches. Main help is required to group them into single line, each line represents a cluster – bala Jan 10 '13 at 21:30
  • Please try and explain what you mean by similarity/grouped/common it's not clear and I can't pick out the structure, the first row starts `AAT..` but the 5th element starts `GGC..` why? – Chris Seymour Jan 10 '13 at 21:37
  • I have modified my question, see towards the end, does it make sense – bala Jan 10 '13 at 21:45
  • how are the 6 columns defined – Kent Jan 10 '13 at 21:49
  • 1
    Glad it's not just me failing to understand, I still can't see how you defined the rows and columns `:|` – Chris Seymour Jan 10 '13 at 21:50
  • the input has only 2 columns in the input file, they determine pairs. I want to make a union of pairs based on the intersecting identifier, for eg. 1 2 are a pair 2 3 is the next pair, the common number that is connecting between the rows is number 2, thus I can join them based on the overlapping number 2. so my output in this eg. will be 1 2 3 (dont read as 3 columns), I can call them as one cluster. Is this better ? – bala Jan 10 '13 at 21:56
  • In your simplified example, I can see that that `1 2 3 8` are connected, but given that series, I don't see how 10 matches that rule. Good luck! – shellter Jan 10 '13 at 21:58
  • 1 2 in first row and 10 2 in row seven, 2 is common between them, which will group 10 in the cluster and thus 11 from the row eight 10 11 (because 10 is the overlapping number) – bala Jan 10 '13 at 22:06
  • This is *not clustering*. You want to compute the transitive closure of your identity relation. The straightforward approach would use some hashmap mapping records to the list of identical records, then print all unique lists at the end. – Has QUIT--Anony-Mousse Jan 11 '13 at 07:21

2 Answers2

1

I think this is not really it, but anyway:

use strict;
use warnings;
my @rows;
my %indx;
while(<DATA>) {
  chomp;
  my @v = split (/\s+/);
  my $r = {};
  for my $k (@v) {
    $r = $indx{$k}[0] if defined $indx{$k};
  }
  $r->{$v[0]}++;
  $r->{$v[1]}++;
  # print join(",", @v), "\n";
  push(@{$indx{$v[0]}}, $r);
  push(@{$indx{$v[1]}}, $r);
  push(@rows,  $r);
}
my %seen;
for my $r (@rows) {
  print (join("\t", keys %$r), "\n") if not $seen{$r}++;
}

__DATA__
AATGCTA|1   AATCGTA|2
AATCGTA|2   AATGGTA|3
AATGGTA|3   AATGGTT|8
TTTGGTA|4   ATTGGTA|5
ATTGGTA|5   CCTGGTA|9
CCCGGTA|6   GCCGGTA|7
GGCGGTA|10  AATCGTA|2
GGCGGTA|10  TGCGGTA|11
CAGGCA|12   GAGGCA|13

Output:

GGCGGTA|10  AATGCTA|1   AATGGTT|8   AATCGTA|2   AATGGTA|3   TGCGGTA|11
CCTGGTA|9   TTTGGTA|4   ATTGGTA|5
CCCGGTA|6   GCCGGTA|7
CAGGCA|12   GAGGCA|13
perreal
  • 94,503
  • 21
  • 155
  • 181
  • Thank you, seems to be working. I will change your script to read an input file and see if it works, – bala Jan 10 '13 at 22:35
  • This won't work if you append `CAGGCA|12 GAGGCA|1` to your input, i.e., there should be at least one unique element at each row. But I'll update this soon. – perreal Jan 10 '13 at 22:38
  • In my input there is always one unique element, so the above may be sufficient – bala Jan 11 '13 at 14:49
1

as you wished, here comes awk solution:

awk 'BEGIN{f=1}{c=0;
        for(i=1;i<=f;i++){
                if(!a[i]){
                        a[i]=$1" "$2; c=1; break;
                }else if(a[i]~$1){
                        a[i]=a[i]" "$2; c=1; break;
                }else if(a[i]~$2){ a[i]=a[i]" "$1; c=1; break; }
        }
        if(!c){ a[++f]=$1" "$2; c=0; }
} END{for(x=1;x<=f;x++)print a[x]}' DnaFile

the codes above was tested with your simpler input file and original file(with CCGGTTAA etc.) both worked. Output omitted.

Kent
  • 189,393
  • 32
  • 233
  • 301