15

I have a data.frame object. For a simple example:

> data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
  x  y   z
1 A Ab Abb
2 A Ac Acc
3 B Ba Bad
4 B Ba Bae
5 B Bd Bdd

there are a lot more rows and columns in the actual data. how could I create a nested tree structure object of dendrogram like this:

         |---Ab---Abb
     A---|
     |   |---Ac---Acc
   --|                 /--Bad 
     |   |---Ba-------|
     B---|             \--Bae
         |---Bb---Bdd
RNA
  • 146,987
  • 15
  • 52
  • 70
  • 3
    you want to plot it or have an object? If it's the latter, then why not just a list `split(df$y, df$x)`? or you require one of those graph/tree packages..? – Arun Mar 11 '13 at 16:15
  • I actually want both, so split(df$y, df$x) doesn't work very well. dendrogram object seems a good one. sorry I didn't make clearer. – RNA Mar 11 '13 at 16:28
  • have you checked the [**`ggdendro`**](https://github.com/andrie/ggdendro) package from @Andrie. – Arun Mar 11 '13 at 17:35
  • yes, briefly. As far as I understand, it's mainly a dendrogram plotting library. – RNA Mar 11 '13 at 22:18
  • @RNAer can you give example of third level? – CHP Mar 12 '13 at 02:58
  • @RNAer so don't you essentially just want to split last column as it contains all the info of it's "parents"? – CHP Mar 12 '13 at 04:52
  • @RNAer something like `do.call(rbind, strsplit(as.character(df[,ncol(df)]), split=""))` – CHP Mar 12 '13 at 04:56

2 Answers2

20

data.frame to Newick

I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.

I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.

    ## recursion function
    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            }
        }
        else { (newickout <- a) }
    }

    ## data.frame to newick function
    df2newick <- function(df, innerlabel=FALSE){
        alevel <- as.character(unique(df[,1]))
        newick <- NULL
        for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
        (newick <- paste("(",paste(newick,collapse=","),");",sep=""))
    }

The main function df2newick() takes two arguments:

  • df which is the dataframe to be transformed (object of class data.frame)
  • innerlabel which tells the function to write labels for inner nodes (bulean)

To demonstrate it on your example:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
    myNewick <- df2newick(df)
    #[1] "((Abb,Acc),((Bad,Bae),Bdd));"

Now you could read it into a object of class phylo with read.tree() from ape

    library(ape)
    mytree <- read.tree(text=myNewick)
    plot(mytree)

If you want to add inner node labels to the Newick string, you can use this:

    myNewick <- df2newick(df, TRUE)
    #[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"

Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)


Additional note for your dataframe format:

As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:

    df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))

Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:

    traverse <- function(a,i,innerl){
        if(i < (ncol(df))){
            alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
            desc <- NULL
            ##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
            ##else {
                for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                il <- NULL; if(innerl==TRUE) il <- a
                (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
            ##}
        }
        else { (newickout <- a) }
    }

and you would get something like this:

    [1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"

This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.

Martin Turjak
  • 20,896
  • 5
  • 56
  • 76
1

I don't know much about the internal structure of dendrograms in R, but the following code will create a nested list structure that has the hierarchy that I think you look for:

stree = function(x,level=0) {
#x is a string vector
#resultis a hierarchical structure of lists (that contains lists, etc.)
#the names of the lists are the node values.

level = level+1
if (length(x)==1) {
    result = list()
    result[[substring(x[1],level)]]=list()
    return(result)
}
result=list()
this.level = substring(x,level,level)
next.levels = unique(this.level)
for (p in next.levels) {
    if (p=="") {
        result$p = list()
    } else {
        ids = which(this.level==p)
        result[[p]] = stree(x[ids],level)
    }
}
result
}

it operates on a vector of strings. so in case of your dataframe you'd need to call stree(as.character(df[,3]))

Hope this helps.

amit
  • 3,332
  • 6
  • 24
  • 32