4

My problem: need to find all disjoint (non-overlapping) sets from a set of sets.

Background: I am using comparative phylogenetic methods to study trait evolution in birds. I have a tree with ~300 species. This tree can be divided into subclades (i.e. subtrees). If two subclades do not share species, they are independent. I'm looking for an algorithm (and an R implementation if possible) that will find all possible subclade partitions where each subclade has greater than 10 taxa and all are independent. Each subclade can be considered a set and when two subclades are independent (do not share species) these subclades are then disjoint sets.

Hope this is clear and someone can help.

Cheers, Glenn

The following code produces an example dataset. Where subclades is a list of all possible subclades (sets) from which I'd like to sample X disjoint sets, where the length of the set is Y.

###################################
# Example Dataset
###################################

    library(ape)
    library(phangorn)
    library(TreeSim)
    library(phytools)

    ##simulate a tree

    n.taxa <- 300
    tree <- sim.bd.taxa(n.taxa,1,lambda=.5,mu=0)[[1]][[1]]
    tree$tip.label <- seq(n.taxa)

    ##extract all monophyletic subclades

    get.all.subclades <- function(tree){
    tmp <- vector("list")
    nodes <- sort(unique(tree$edge[,1]))
    i <- 282
    for(i in 1:length(nodes)){
    x <- Descendants(tree,nodes[i],type="tips")[[1]]
    tmp[[i]] <- tree$tip.label[x]
    }
    tmp
    }
    tmp <- get.all.subclades(tree)

    ##set bounds on the maximum and mininum number of tips of the subclades to include

    min.subclade.n.tip <- 10
    max.subclade.n.tip <- 40


    ##function to replace trees of tip length exceeding max and min with NA

    replace.trees <- function(x, min, max){
    if(length(x) >= min & length(x)<= max) x else NA
    }


    #apply testNtip across all the subclades

    tmp2 <- lapply(tmp, replace.trees, min = min.subclade.n.tip, max = max.subclade.n.tip)

    ##remove elements from list with NA, 
    ##all remaining elements are subclades with number of tips between 
##min.subclade.n.tip and max.subclade.n.tip

    subclades <- tmp2[!is.na(tmp2)]

    names(subclades) <- seq(length(subclades))
user1322491
  • 41
  • 1
  • 4
  • So you consider each subtree as a subset, and you want to find a collection of subsets such that *each* elements is contained exactly once; or do you not care if some elements are left out? – Tom May 21 '13 at 16:45
  • 2
    If you mean the "exactly once" interpretation, then this is the [Exact cover problem](http://en.wikipedia.org/wiki/Exact_cover_problem). It is NP-complete, which means you won't do much better than simply backtracking and testing all possibilities. – ffao May 21 '13 at 16:51
  • 1
    I was gonna say,sounds like a problem for http://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X . – Tom May 21 '13 at 16:52
  • 1
    If your dataset is not terribly large, then start by collecting all `length(clade.x) >=10` and then run `intersect` on all combinations of that subset. – Carl Witthoft May 21 '13 at 16:59
  • Agrred with @CarlWitthoft, you can do something like `i <- 3; j <- 4; length(intersect(A[[i]],A[[j]]))>0` – Stéphane Laurent May 21 '13 at 17:08
  • Hi Tom, Each element (species) of the main set (the original tree) does not have to be included exactly once in the subsets. i.e. the resulting tree of any given set of subtrees may be paraphyletic. – user1322491 May 21 '13 at 17:54
  • Nthing that you want Algorithm X, with the simple extension to packing constraints (at most one instead of exactly one). – David Eisenstat May 21 '13 at 18:39

2 Answers2

2

Here's an example of how you might test each pair of list elements for zero overlap, extracting the indices of all non-overlapping pairs.

findDisjointPairs <- function(X) {
    ## Form a 2-column matrix enumerating all pairwise combos of X's elements
    ij <- t(combn(length(X),2))    
    ## A function that tests for zero overlap between a pair of vectors
    areDisjoint <- function(i, j) length(intersect(X[[i]], X[[j]])) == 0     
    ## Use mapply to test for overlap between each pair and extract indices 
    ## of pairs with no matches
    ij[mapply(areDisjoint, ij[,1], ij[,2]),]
}

## Make some reproducible data and test the function on it
set.seed(1)
A <- replicate(sample(letters, 5), n=5, simplify=FALSE)    
findDisjointPairs(A)
#      [,1] [,2]
# [1,]    1    2
# [2,]    1    4
# [3,]    1    5
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • He still needs to filter on his desired element size, but otherwise this looks very nice. – Carl Witthoft May 21 '13 at 19:03
  • 1
    @CarlWitthoft -- True. That's trivial enough, though, that I'll let the OP add it to his/her workflow wherever it best fits. (One could always just prepend the single line in `areDisjoint`'s body with `length(X[[i]] >= 10 & length(X[[j]]) >= 10 &`, but I've no idea whether this function needs to be optimized for speed, in which case I'd do something more complicated.) – Josh O'Brien May 21 '13 at 19:16
  • Hi @JoshO'Brien. Thanks for the comments and the function, works great. Now the issue is finding all possible combinations from the universe of sets. When the number of possible combinations to calculate gets large, e.g. choose(50,20), my computer doesn't have enough memory. Is there a way to randomly sample combinations from a very large list of possible combinations? Something like combn(50,20). If I could do that, then I'd just randomly select combinations of sets until the loop accrued a sufficient number of random sets of independent subclades for my analyses. Thanks! – user1322491 May 23 '13 at 15:32
  • @user1322491 -- If I understand your question, here's an idea that should help. `replicate(5, sample(1:50, size=20))` (This example gives you five sets of 20 integers selected from between 1 and 50, which you can then use as indices into your list of subsets.) – Josh O'Brien May 23 '13 at 15:44
1

Here are some functions that might be useful.

The first computes all possible disjoint collections of a list of sets.

I'm using "collection" instead of "partition" beacause a collection does not necessarily covers the universe (i. e., the union of all sets).

The algorithm is recursive, and only works for a small number of possible collections. This does not necessarily means that it won't work with a large list of sets, since the function removes the intersecting sets at every iteration.

If the code is not clear, please ask and I'll add comments.

The input must be a named list, and the result will be a list of collection, which is a character vector indicating the names of the sets.

DisjointCollectionsNotContainingX <- function(L, branch=character(0), x=numeric(0))
{
    filter <- vapply(L, function(y) length(intersect(x, y))==0, logical(1))

    L <- L[filter]

    result <- list(branch)

    for( i in seq_along(L) )
    {
        result <- c(result, Recall(L=L[-(1:i)], branch=c(branch, names(L)[i]), x=union(x, L[[i]])))
    }

    result
}

This is just a wrapper to hide auxiliary arguments:

DisjointCollections <- function(L) DisjointCollectionsNotContainingX(L=L)

The next function can be used to validade a given list of collections supposedly non-overlapping and "maximal".

For every collection, it will test if
1. all sets are effectively disjoint and
2. adding another set either results in a non-disjoint collection or an existing collection:

ValidateDC <- function(L, DC)
{
    for( collection in DC )
    {
        for( i in seq_along(collection) )
        {
            others <- Reduce(f=union, x=L[collection[-i]])

            if( length(intersect(L[collection[i]], others)) > 0 ) return(FALSE)
        }

        elements <- Reduce(f=union, x=L[collection])

        for( k in seq_along(L) ) if( ! (names(L)[k] %in% collection) )
        {
            if( length(intersect(elements, L[[k]])) == 0 )
            {
                check <- vapply(DC, function(z) setequal(c(collection, names(L)[k]), z), logical(1))

                if( ! any(check) ) return(FALSE)
            }
        }
    }

    TRUE
}

Example:

L <- list(A=c(1,2,3), B=c(3,4), C=c(5,6), D=c(6,7,8))

> ValidateDC(L,DisjointCollections(L))
[1] TRUE

> DisjointCollections(L)
[[1]]
character(0)

[[2]]
[1] "A"

[[3]]
[1] "A" "C"

[[4]]
[1] "A" "D"

[[5]]
[1] "B"

[[6]]
[1] "B" "C"

[[7]]
[1] "B" "D"

[[8]]
[1] "C"

[[9]]
[1] "D"

Note that the collections containing A and B simultaneously do not show up, due to their non-null intersection. Also, collections with C and D simultaneously don't appear. Others are OK.

Note: the empty collection character(0) is always a valid combination.

After creating all possible disjoint collections, you can apply any filters you want to proceed.


EDIT:

  1. I've removed the line if( length(L)==0 ) return(list(branch)) from the first function; it's not needed.

  2. Performance: If there is considerable overlapping among sets, the function runs fast. Example:

    set.seed(1)

    L <- lapply(1:50, function(.)sample(x=1200, size=20))

    names(L) <- c(LETTERS, letters)[1:50]

    system.time(DC <- DisjointCollections(L))

Result:

#   user  system elapsed 
#   9.91    0.00    9.92

Total number of collections found:

> length(DC)
[1] 121791
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
  • Thanks for the functions. I gave them a try and they work great for a small list of sets, length < 15. I tried it on a large list, length > 40, and R crashed. I'll likely be scaling up my analyses, and will need a function that can handle a list of sets of length >50. I should clarify that I am most interested in randomly sampling from the list of disjoint collections of length > X. X is probably going to be 8 or more. I admit I don't quite understand how your Recall() function works, but is there anyway to modify it sample only disjoint collections of length > X? Thanks! Glenn – user1322491 May 22 '13 at 18:12
  • Hi Glenn @user1322491. `Recall` is shorthand for the function it's called within. It's the idiom for a recursive function. These funcitions can be optimized a little further. But I thought they would work fine for a large list of sets *with considerable overlapping between set*. Unfortunately, it seems this is not the case. Some questions: 1. Is the overlapping rare in your sets? 2. R hangs or just runs out of memory for a list > 40? 3. By randomly sampling do you mean a equiprobable sample over the set of possible combinations? – Ferdinand.kraft May 22 '13 at 20:31
  • Hi Ferdinand @Ferdinand.kraft 1. Within any given combination of sets, its more likely that there will be overlap than not, but I'd be hard pressed to give you the actual proportion. 2. When I applied your function to a set of 45 sets of variable length and overlap, R stopped being responsive. I couldn't even click on the application to Force Quit (I use a Mac). I eventually had to restart my computer manually. – user1322491 May 23 '13 at 15:19
  • I'm a bit curious about that. Can you upload your list and post a link? Are you sure you have a *named* list of *numeric* vectors? BTW, I'm working on a different approach meanwhile :-) – Ferdinand.kraft May 23 '13 at 16:34
  • Hi Ferdinand, I've modified my original post. It now includes code to create a dummy dataset that creates a named list of numeric vectors. To get it to work you'll have to load the phylogenetics packages listed at the beginning. Each numeric vector is a vector of species names from a monophyletic subclade and the list is all the monophyletic subclades greater than 10 species and less than 40 species. Thanks for your help! – user1322491 May 23 '13 at 17:05
  • I've tried your list and I get `node stack overflow` after a few hours :-). It seems there is little overlapping among your sets, and the number of possible collections is large. But if we reduce the number of sets a little, we get it to work: using L=subclades[1:30], I got about 1 million collections, in less than 5 minutes. – Ferdinand.kraft May 25 '13 at 02:06