0

This question stems directly from a previous one I asked here:

Phylogenetics in R: collapsing descendant tips of an internal node

((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;

I have created this tree from a gene tree in R, using the midpoint function to root it. Now I apply the following function to it:

drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

Now my issue is that when I apply this function I do not get the tree I expected. The output when plotted looks like this:

enter image description here

However, if I write the tree out to a text file using write.tree and then read it in again as a Newick string, when I then apply the drop_dupes function above, I get a different tree, as shown in the below plot:

enter image description here

And this is what is confusing me. Why am I getting two different outputs when applying the same function to the very same tree, with the only difference being if it's read in or it's already a variable in R?

Edit: Here are some further details. The initial gene tree is as follows:

(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);

I read it into R as tree1. I then use the following code:

#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
  tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
  ##some of these are tips, some are nodes, need to treat differently
  tr$tip.label<-1:(length(tr$tip.label))
  return(tr)
}

#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
  if(is.rooted(tree) == FALSE){
    rootedTree <- midpoint(tree)
  }
  return(rootedTree)
}

#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)

Now at this point if I plot tree d, I will get the first plot above. However, if I write tree c to a text file, then read it back in and and then use the drop_dupes function on it, I will get the latter tree.

I have checked the newick file of tree c against the Newick tree at the top of the page and it is definitely the same.

Community
  • 1
  • 1
spiral01
  • 545
  • 2
  • 17

1 Answers1

1

The problem is in the sort.names function. It effectively rearranges the way the tree is written, and indexing then refers to other nodes than it should.

If you need the tip.labels numbered, why not label them separately?

tax.num <- data.frame(taxa = tree1$tip.label, 
    numbers = 1:(length(tree1$tip.label)))

a <- midpoint(drop_dupes(tree1))
a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]

plot(a)

Also, di2multi seems to be redundant. It creates polytomies where the goal, as I understand it, is to discard tips with short branches.

nya
  • 2,138
  • 15
  • 29
  • Thanks, this resolves the issue perfectly. Just as a further point, as I need the nodes and tips labelled iteratively for later analysis, I would use a <- midpoint(drop_dupes(tree1)) first, and then create the data frame so that it only contains the remaining tips, and I can then add the node labels for a continuous sequence from tips to nodes. – spiral01 Jul 27 '16 at 17:14
  • 1
    @spiral01 As these are gene trees, I assume that you will have variable taxa sets between trees. Make sure to store the tax.num table for each gene tree to be able to check identity later on. – nya Jul 27 '16 at 17:17