I have my data of pairwise DNA sequences showing similarity in the following way..
AATGCTA|1 AATCGTA|2
AATCGTA|2 AATGGTA|3
AATGGTA|3 AATGGTT|8
TTTGGTA|4 ATTGGTA|5
ATTGGTA|5 CCTGGTA|9
CCCGGTA|6 GCCGGTA|7
GGCGGTA|10 AATCGTA|2
GGCGGTA|10 TGCGGTA|11
CAGGCA|12 GAGGCA|13
The above is a sample input file, the original file is few millions rows. I want output to be cluster the overlapping ids based on the common elements between the rows and output them to one single line for each cluster, as below
AATGCTA|1 AATCGTA|2 AATGGTA|3 AATGGTT|8 GGCGGTA|10 TGCGGTA|11
TTTGGTA|4 ATTGGTA|5 CCTGGTA|9
CCCGGTA|6 GCCGGTA|7
CAGGCA|12 GAGGCA|13
I am currently trying to cluster them using mcl and also silix, I was not successful in running silix. But the mcl is currently in progress, I would like to know if there are any other ways smart ways of doing this may be in awk or perl. I appreciate some solution, thank you. (this is my first post I am sorry if I have made some mistake)
Just to make it simpler.. is it easy to say my input is,
1 2
2 3
3 8
4 5
5 9
6 7
10 2
10 11
12 13
and I want output to be,
1 2 3 8 10 11
4 5 9
6 7
12 13