Algorithm to decide cut-off for collapsing this tree?

Question

I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences.

An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters:

enter image description here

My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python package). This is biologically interesting since some of the gene regulatory DNA motifs may be homologous (paralogues or orthologues) with one another. This collapsing can be done via the iTol GUI linked above, e.g. if you choose X = 0.001 then some motifs become collapsed into triangles (motif families).

My question: Could anybody suggest an algorithm that would either output or help visualise which value of X is appropriate for "maximizing the biological or statistical relevance" of the collapsed motifs? Ideally there would be some obvious step change in some property of the tree when plotted against X which suggests to the algorithm a sensible X. Are there any known algorithms/scripts/packages for this? Perhaps the code will plot some statistic against the value of X? I've tried plotting X vs. mean cluster size (matplotlib) but I don't see an obvious "step increase" to inform me which value of X to use:

enter image description here

My code and data: A link to my Python script is [here][8], I have heavily commented it and it will generate the tree data and plot above for you (use the arguments d_from, d_to and d_step to explore the distance cut-offs, X). You will need to install ete2 by simply executing these two bash commands if you have easy-install and Python:

apt-get install python-setuptools python-numpy python-qt4 python-scipy python-mysqldb python-lxml

easy_install -U ete2

@Llopis i've changed to title to make what I said in my comment more clear — hello_there_andy, Apr 29 '14 at 11:56
Then you should show what you have tried (the code) and a minimal set of data to make it easy for others to try it. But maybe is better ask each one on different questions, as they do not need to be related. — llrs, Apr 29 '14 at 13:41
Thank you @Llopis for these helpful tips, I have actually neatened and commented my code (+ data use to make the plot) but it cannot fit in the OP. Maybe I'll make a dropbox link — hello_there_andy, Apr 29 '14 at 13:55
If you have such problem probably you are out of focus and should present the relevant piece of code to the error you find, or present in which step you don't know how to make it to reach your goal. — llrs, Apr 29 '14 at 14:06
AH i meant the data cannot fit, since I wanted to put the data and the script in one (the tree is generated in the users current working dir) — hello_there_andy, Apr 29 '14 at 14:20
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/51729/discussion-between-llopis-and-hello-there-andy) — llrs, Apr 30 '14 at 08:32
no helpful answers there yet https://www.biostars.org/p/99186/#99225 — hello_there_andy, May 03 '14 at 11:29
"maximising the biological relevance of the collapsed motifs" means next to nothing to me. Could you rephrase this programming question as a programming question? — bukzor, May 05 '14 at 01:17
sorry, the truth is it's hard to know a priori what "biological relevance" is related to anyway... i've rephrased it now — hello_there_andy, May 05 '14 at 11:54
What you're asking for is a threshold above which collapsing a phylogenetic tree would generate biologically relevant clusters. That is a _biology question_, not a programming one: what's at stake is what level of gene homology groups organisms at a biologically relevant category. If you want, you can do a [binary search](http://en.wikipedia.org/wiki/Binary_search_algorithm) among possible values. I doubt this will be effective though. Maybe ask this on [Research Gate](http://www.researchgate.net/) instead, or at the [Biology stack exchange community](http://biology.stackexchange.com) — kadu, May 05 '14 at 13:34
@kadu this carries the bioinformatics tag, the problem should have a programmatic solution (e.g. renowned boinformatics algorithms: needleman-wunsch algorithm, smith-waterman algorithm, etc) if this had a similar solution and there was a bioinformatician there who'd already solved this kind of problem I was hoping they'd post here... — hello_there_andy, May 05 '14 at 17:09
@hello_there_andy I understand it's within bioinformatics, but what I'm questioning is the nature of the question. For example, the needleman-wunsch and smith-waterman are string alignment and comparison algorithms that use matrices to associate values to similarities and gaps between them. While the actual values used in the matrices were proposed in a way that makes them biologically relevant to the study of genetic or proteic sequences, they can easily be tweaked for use in other fields, [like linguistics](http://bit.ly/1s5K8qZ). — kadu, May 08 '14 at 05:38
I'm saying this not because I don't think the question is interesting or relevant, but because it doesn't seem that your problem is with the actual algorithm, but with the biologically relevant values involved. In the comparison you made, it appears to me that you're trying to determine the best values for the similarity matrices and gap penalties, not how to implement the algorithm. If that is so, I think the communities I mentioned will put you in touch with professionals that actually understand the background for the problem you're tackling, and that would lead to a better answer. — kadu, May 08 '14 at 05:44

score 1 · Answer 1 · answered May 08 '14 at 06:48

I think I'd need to know more before I can give specific suggestions. But maybe this will help. I'm assuming that each terminal node is a sequence, and each internal node is a PSSM.

The calculation for X is application specific. For example, the X you get if you want to collapse ultraparalogs isn't the same as the X you get when you want to collapse all homologs.

Since genes are continuously being created via duplication and speciation, there isn't a single value for X that will discriminate sequences by evolutionary relationship. Therefore, I don't expect you'll find a satisfying proxy for determining evolutionary relationships between sequences by looking only at cluster statistics.

A more rigorous method would build a gene tree from the gene of each regulatory motif and reconcile it with a species tree. There's software out there and additional heuristics to ortholog / inparalog identification.

If you do this, the internal nodes of your tree will be decorated with the inferred evolutionary event (e.g., duplication, speciation). Then you can walk up the tree collapsing nodes for clades you don't care about.

+1 for the species-tree reconciliation tip, the problem is our focus is on keeping the analysis pipeline as "de novo" as possible and rely less on things such as phylogenies. We'd have hoped to see a signal somewhere within the data we have here (which has not relied upon prior evolutionary inferences such as a species tree). As I mentioned, the hope was to find some step increase in some metric that would be generated by an algorithm working on the tree or the dna motif profile matrices — hello_there_andy, May 09 '14 at 14:16

score 1 · Accepted Answer · answered Dec 28 '14 at 13:26

You could try and use something similar to tree reconciliation as @Jeff mentioned. But standard tree reconciliation will actually fail.

Reconciliation involves firstly adding branches that represent "losses" of evolutionary characters throughout the target tree. Then indicating the nodes at which "duplications" of evolutionary characters have occurred. The weighted sum of losses and duplications provide a cost function to optimise for.

But in your case, the problem you want to solve is "break this super-tree into appropriately sized, orthologous sub-trees". This means you don't really want to score losses as much as you would duplications. You want a way to score the tree such that it reveals how many orthologous sub-trees are merged into your super-tree. Thus you can try this scoring approach:

Take a super-tree, count the number of duplicate species, S1.
Collapse all terminal leaves that are paralogues and count the new number of duplicate species, S2.
The difference between S1 and S2 reveals approximately how many sub-trees you have in the super-tree.
To correct for any bias caused by various sized super-trees divide by the number of unique species represented in the super-tree N.

If we call this score the "sub-tree factor" then it equates to:

S1 - S2 / N

Inferences:

If S1 - S2 = S1 then it means your super-tree has approximately one true sub-tree within it, that all multiple species occurrences were just due to recent paralogues.
If S1 - S2 = 0 then it means your super-tree has approximately S1 true sub-trees within it.

Algorithm to decide cut-off for collapsing this tree?

2 Answers2

Linked