3

I have several thousand gene trees that I am trying to ready for analysis with codeml. The tree below is a typical example. What I want to do is automate the collapsing of tips or nodes that appear to be duplicates. For instance, descendants of node 56 are tips 26, 27, 28 etc all the way to 36. Now all of these other than tip 26 appear to be duplicates. How can I collapse them all into a single tip, leaving just tips 28 and one representative of the other tips as the descendants of node 56?

I know how to manually do this one by one, but I am trying to automate the process so that a function can identify which tips need to be collapsed and then reduce them to a single representative tip. So far I have been looking at the cophenetic function which calculates the distances between the tips. However, I am not sure how to use that information to collapse tips.

Here is the newick string for the below tree:

((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;

enter image description here

spiral01
  • 545
  • 2
  • 17
  • What are your criteria for determining if nodes are duplicates? Is it just distance between tips? If so, what's the threshold? Also, it will be easier for other people to help if you can provide the newick string for this tree. – C_Z_ Jul 25 '16 at 16:29
  • Hi, yes it is the distance between the tips. The threshold I am working with is 1e-05, although that is just arbitrary for now. – spiral01 Jul 25 '16 at 16:46

1 Answers1

3

One option is to drop tips that have a length beneath the threshold.

drop_dupes <- function(tree,thres=1e-5){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  drop.tip(tree,tree$tip.label[toDrop])
}

plot(drop_dupes(tree))

enter image description here

C_Z_
  • 7,427
  • 5
  • 44
  • 81
  • Ah ofcourse, using edge.length! Thank you so much this is exactly what I was looking for! – spiral01 Jul 25 '16 at 17:13
  • I think this is not doing what OP question asked for: when lenghts of edges leading to tips are below threshold, the tips are dropped completely (e.g. the edge leading to branch with tips 8,9,10 is dropped without replacement). – al-ash Mar 29 '21 at 04:07