1

I'm hoping to get some advice about managing large sets of combinations in R.

I'm a plant breeding graduate student attempting to calculate the highest ranked average values of the progenies of various combination of 40 parents in a plant population. I begin by creating a matrix that contains the values obtained by crossing these parents:

# fake data
B <- matrix (data=runif(1600, 1.0, 5.0),ncol=40,nrow=40)
diag(B) <- diag(B) - 1 # diagonals are when plants are crossed to themselves and suffer inbreeding depression

I do this by finding the average of a subset of the matrix ("perse.hybrid") containing the various combinations of parents:

SubsetWright <- function (perse.hybrid, subset) {
  return (mean(perse.hybrid[subset,subset]))
}

Ideally I would like to find the values of the offspring for all combinations of 40 parents, but slightly more realistically I would like to find the values for combination of 2 to 11 parents. That is around 3.5 billion combinations.

I've been working on speeding this up and also managing the memory. To speed it up, I have set it to run the tasks in parallel on an Amazon EC2 cluster (typically 3 m4.10xlarge machines). To address the memory challenge, I have tried to keep the data in big.matrix. However, I appear to be bumping up against combn. Typically when I get to 40 choose 8, it crashes. Watching htop, I believe that's because of the memory use.

I'm a novice R user and do not understand exactly how memory is managed in R. It seems like I could get these limits if I could somehow split up the combn function, which might both allow me to run it in parallel and to avoid the memory limits. Or perhaps is there a way to populate the big.matrix with all of the combinations without using combn. Does anyone have any strategies to suggest? The code is below. Thank you so much!

#' Test all combinations of parents to find set of offspring with highest estimated mean.
#'
#' @param perse.hybrid  A matrix of offspring values, with row[i]=col[j]=parent ID 
#' @param min The minimum number of parents to test combinations of
#' @param max The maximum number of parents to test combinations of
#' @param rows Number of rows of top combinations to return, default is to return all rows
#' @param cl cluster to use
#' @return A big.matrix with parent combinations and predicted average offspring values 
TestSyn <- function (perse.hybrid, min, max, rows="all", cl) {

      clusterExport(cl, list("SubsetWright"))

      total <- sum(apply(X=array(min:max),MARGIN=1,FUN=choose,n=nrow(perse.hybrid)))
      n <- nrow(perse.hybrid)
      start <- 1
      stop <- choose(n,min)

      syn.data <- big.matrix(nrow=total,ncol=max+1)

      for (i in min:max)
      {

        #add inbred numbers to syn.data. This seems to be what crashes when i gets large (>7)
        inbreds <- t(combnPrim(1:n,i))
        syn.data[start:stop,1:i] <- inbreds

        #add sythetic values to syn.data
        syn.data[start:stop,max+1]  <- parApply(cl=cl,X=inbreds,MARGIN=1,FUN=SubsetWright,perse.hybrid=perse.hybrid)

                     start <- stop + 1
                     stop <- start + choose(n,i+1) - 1

      }

      # sort by offspring average
      mpermute(x=syn.data,cols=max+1,decreasing=TRUE)

      if (rows == "all") rows <- nrow(syn.data)

      return (syn.data[1:rows,])
    }

EDIT:

With more investigating it looks like combinadics could be part of a solution. Stackoverflow post here: Generating a very large matrix of string combinations using combn() and bigmemory package

I'll do some testing to see if this holds up with large numbers.

Community
  • 1
  • 1
  • Two problems I think you're running into is speed and R's pass-by-value (versus pass-by-reference), both of which can be addressed with (for example) Rcpp ([CRAN](https://cran.r-project.org/web/packages/Rcpp/index.html) and [own website](http://rcpp.org)). Though it is C++, Dirk, Romain, and company have made considerable effort (and success, IMO) in bridging the gap for R programmers. – r2evans Sep 18 '15 at 23:26
  • 2
    Something else you might want to consider is something I'll borrow from python: [generators](https://wiki.python.org/moin/Generators). Instead of creating a huge list of combinations before even doing anything, create them *as needed*. This extends your code a little bit but will pay dividends in situations like this. – r2evans Sep 18 '15 at 23:27

0 Answers0