0

I am working with around 2600+ genomes and wish to study the genome, gene and intergenic features among various groups. In case of taxonomical groups which have very few representatives, there is no issue. In case of taxonomical groups having multiple genomes, on what basis shall I remove similar genomes so as to get just a few representatives from each taxonomic group. Should I use lenght or GC% or some other feature to remove genomes - like if two genome have a GC% variation of less than 1% I shall remove that. Some thing like that. Please suggest accepted ways and kindly explain the reason as well.

Example:
I have around 60 genomes of Mycobacterium sps
More than 20 are of M. tuberculosis alone which have
GC% range of 65.48 to 65.7 and
Length range of 4.27 to 4.41 MB

How to screen and remove similar genomes in such cases?

SRKR
  • 33
  • 8

2 Answers2

0

I see no reason to use GC % as an acceptable filter.

What makes sense to me is a more functional approach, such as 1) shared genes, and 2) sequence similarity of said genes.

Chrismit
  • 1,488
  • 14
  • 23
0

You can build a phylogenetic tree first and then select one or more genome for each (more or less arbitrarily defined) clade / group / cluster.

I would not recommend using single marker gene for building the tree as in your case those genomes / species are very closely related. Try the concatenation of all the core gene set.

Minli Xu
  • 33
  • 7