0

I'm trying to do this:

nmf.sub <- function(n){
sub.data.matrix <- data.matrix[, (index[n, ])] ## the index is a permutation of the original matrix at a 0.8 resampling proportion (doesn't really matter)
temp.result <- nmf(sub.data.matrix, rank = 2, seed = 12345) ## want to change 2 to i
return(temp.result)
}

class.list <- list()
for (i in nmf.rank){ ## nmf.rank is 2:4
results.list <- mclapply(mc.cores = 16, 1:resamp.iterations, function(n) nmf.sub(n)) ## resamp.iterations is 10, nmf.sub is defined above
}

But instead of having rank = 2 in the nmf for temp.result, I want to have rank = i

Any idea how I could pass it that parameter? Just passing it through mclapply as function(n, i) doesn't work.

1 Answers1

1

You seemingly have two loops: one for i in nmf.rank and one for n in 1:resamp.iterations. Therefore, you need to pass both i and n to nmf.sub e.g. like in:

nmf.sub <- function(n, i){
    ## the index is a permutation of the original matrix at a 0.8
    ## resampling proportion (doesn't really matter)
    sub.data.matrix <- data.matrix[, (index[n, ])] 
    ## want to change 2 to i
    temp.result <- nmf(sub.data.matrix, rank = i, seed = 12345)
    return(temp.result)
}


resamp.iterations <- 10
nmf.rank <- 2:4

res <- lapply(nmf.rank, function(i){
    results.list <- mclapply(mc.cores = 16, 1:resamp.iterations,
                             function(n) nmf.sub(n,i))
})
## then you can flatten/reshape res

Regarding your comment (below) about efficiency: the bulk of the numerical calculations is performed within the nmf() function, therefore the loop is properly set up, in the sense that each process/core gets a numerically intensive job. However, to speed up computation you might consider using the previously computed result, instead of the seed 12345 (unless using the latter seed is mandatory for some reason related to your problem). In the following example I get a 30-40% reduction in execution time:

library(NMF)
RNGkind("L'Ecuyer-CMRG") ## always use this when using mclapply()
nr <- 19
nc <- 2e2
set.seed(123)
data.matrix <- matrix(rexp(nc*nr),nr,nc)

resamp.iterations <- 10
nmf.rank <- 2:4

index <- t(sapply(1:resamp.iterations, function(n) sample.int(nc,nc*0.8)))


nmf.sub <- function(n, i){
    sub.data.matrix <- data.matrix[ ,index[n, ]] 
    temp.result <- nmf(sub.data.matrix, rank = i, seed = 12345)
    return(temp.result)
}

## version 1
system.time({
    res <- lapply(nmf.rank, function(i){
        results.list <- mclapply(mc.cores = 16, 1:resamp.iterations,
                                 function(n) nmf.sub(n,i))
    })
})

## version 2: swap internal and external loops
system.time({
    res <- 
        mclapply(mc.cores=16, 1:resamp.iterations, function(n){
            res2 <- nmf(data.matrix[ ,index[n, ]], rank=2, seed = 12345)
            res3 <- nmf(data.matrix[ ,index[n, ]], rank=3, seed = 12345)
            res4 <- nmf(data.matrix[ ,index[n, ]], rank=4, seed = 12345)
            list(res2,res3,res4)
        })
})

## version 3: use previous calculation as starting point
##   ==> 30-40% reduction in computing time
system.time({
    res <- 
        mclapply(mc.cores=16, 1:resamp.iterations, function(n){
            res2 <- nmf(data.matrix[ ,index[n, ]], rank=2, seed = 12345)
            res3 <- nmf(data.matrix[ ,index[n, ]], rank=3, seed = res2)
            res4 <- nmf(data.matrix[ ,index[n, ]], rank=4, seed = res3)
            list(res2,res3,res4)
        })
})
renato vitolo
  • 1,744
  • 11
  • 16
  • Worked like a treat! Thanks! Do you know if this is the most efficient way of carrying this out. I'm making the function using a dummy matrix, but the actual matrix will be something like 191 x 20000 with iterations set to 1000 – Yuriy Grabovsky Aug 09 '16 at 10:35
  • Please see my edits above; for best reproducibility I always use `RNGkind("L'Ecuyer-CMRG")` with mclapply() or other parallel libraries. – renato vitolo Aug 09 '16 at 14:13
  • The seed is just something we do for reproducibility because we're clustering biological data and we want some semblance of reproduction. Your additional comments are extremely helpful. I will add this to my code. – Yuriy Grabovsky Aug 11 '16 at 07:33
  • OK. I suspect that version 3 of the code should be reproducible anyway, because the seed is fixed to 12345 for the first step (calculation of `res2`) of each loop iteration. Success and break a leg with it. – renato vitolo Aug 12 '16 at 14:09