I'm hoping to get some advice about managing large sets of combinations in R.
I'm a plant breeding graduate student attempting to calculate the highest ranked average values of the progenies of various combination of 40 parents in a plant population. I begin by creating a matrix that contains the values obtained by crossing these parents:
# fake data
B <- matrix (data=runif(1600, 1.0, 5.0),ncol=40,nrow=40)
diag(B) <- diag(B) - 1 # diagonals are when plants are crossed to themselves and suffer inbreeding depression
I do this by finding the average of a subset of the matrix ("perse.hybrid") containing the various combinations of parents:
SubsetWright <- function (perse.hybrid, subset) {
return (mean(perse.hybrid[subset,subset]))
}
Ideally I would like to find the values of the offspring for all combinations of 40 parents, but slightly more realistically I would like to find the values for combination of 2 to 11 parents. That is around 3.5 billion combinations.
I've been working on speeding this up and also managing the memory. To speed it up, I have set it to run the tasks in parallel on an Amazon EC2 cluster (typically 3 m4.10xlarge machines). To address the memory challenge, I have tried to keep the data in big.matrix. However, I appear to be bumping up against combn. Typically when I get to 40 choose 8, it crashes. Watching htop, I believe that's because of the memory use.
I'm a novice R user and do not understand exactly how memory is managed in R. It seems like I could get these limits if I could somehow split up the combn function, which might both allow me to run it in parallel and to avoid the memory limits. Or perhaps is there a way to populate the big.matrix with all of the combinations without using combn. Does anyone have any strategies to suggest? The code is below. Thank you so much!
#' Test all combinations of parents to find set of offspring with highest estimated mean.
#'
#' @param perse.hybrid A matrix of offspring values, with row[i]=col[j]=parent ID
#' @param min The minimum number of parents to test combinations of
#' @param max The maximum number of parents to test combinations of
#' @param rows Number of rows of top combinations to return, default is to return all rows
#' @param cl cluster to use
#' @return A big.matrix with parent combinations and predicted average offspring values
TestSyn <- function (perse.hybrid, min, max, rows="all", cl) {
clusterExport(cl, list("SubsetWright"))
total <- sum(apply(X=array(min:max),MARGIN=1,FUN=choose,n=nrow(perse.hybrid)))
n <- nrow(perse.hybrid)
start <- 1
stop <- choose(n,min)
syn.data <- big.matrix(nrow=total,ncol=max+1)
for (i in min:max)
{
#add inbred numbers to syn.data. This seems to be what crashes when i gets large (>7)
inbreds <- t(combnPrim(1:n,i))
syn.data[start:stop,1:i] <- inbreds
#add sythetic values to syn.data
syn.data[start:stop,max+1] <- parApply(cl=cl,X=inbreds,MARGIN=1,FUN=SubsetWright,perse.hybrid=perse.hybrid)
start <- stop + 1
stop <- start + choose(n,i+1) - 1
}
# sort by offspring average
mpermute(x=syn.data,cols=max+1,decreasing=TRUE)
if (rows == "all") rows <- nrow(syn.data)
return (syn.data[1:rows,])
}
EDIT:
With more investigating it looks like combinadics could be part of a solution. Stackoverflow post here: Generating a very large matrix of string combinations using combn() and bigmemory package
I'll do some testing to see if this holds up with large numbers.