I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences.
An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters:
My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python package). This is biologically interesting since some of the gene regulatory DNA motifs may be homologous (paralogues or orthologues) with one another. This collapsing can be done via the iTol GUI linked above, e.g. if you choose X = 0.001 then some motifs become collapsed into triangles (motif families).
My question: Could anybody suggest an algorithm that would either output or help visualise which value of X is appropriate for "maximizing the biological or statistical relevance" of the collapsed motifs? Ideally there would be some obvious step change in some property of the tree when plotted against X which suggests to the algorithm a sensible X. Are there any known algorithms/scripts/packages for this? Perhaps the code will plot some statistic against the value of X? I've tried plotting X vs. mean cluster size (matplotlib) but I don't see an obvious "step increase" to inform me which value of X to use:
My code and data: A link to my Python script is [here][8], I have heavily commented it and it will generate the tree data and plot above for you (use the arguments d_from, d_to and d_step to explore the distance cut-offs, X). You will need to install ete2 by simply executing these two bash commands if you have easy-install and Python:
apt-get install python-setuptools python-numpy python-qt4 python-scipy python-mysqldb python-lxml
easy_install -U ete2