0

I was hoping that somebody could take a quick look at this example and help me find a better-more efficient way to approach this problem. I want to run a simulation to examine how an animal moves between sites following a set of specific conditions. I have 5 sites and a some initial probabilities,

N<-5 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
to.r<-rbind(sites)

p.move.r<-seq.int(0.05,0.95,by=0.1) # prob of moving to a new site
p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
p.move.out<-0.01*p.move.r # prob of moving in/out
p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

For this example, I only included 50 simulations, but in reality I would like to have at least 1000 simulations,

set.seed(13973)

reps<-50 # number of replicates/simulations
steps<-100 # number of time steps (hours, days, weeks, etc)
random<-runif(10000,0,1) # generating numbers from a random distribution

# Construct empty df to fill with data

rep.movements<-matrix(NA,nrow=reps,ncol=steps)
colnames(rep.movements)<-c(1:steps);rownames(rep.movements)<-c(1:reps)

rep.use<-matrix(NA,nrow=reps,ncol=N)
colnames(rep.use)<-c(reefs);rownames(rep.use)<-c(1:reps)

# Outer loop to run each of the initial parameters

for(w in 1:length(p.stay)){
     p.move<-matrix((p.move.r[w]/(N-1)),N,N)
     diag(p.move)<-0

# Construction of distance matrix
move<-matrix(c(0),nrow=(N+2),ncol=(N+2),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left")))
from<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left")))
to<-array(0,c((N+2),(N+2)),dimnames=list(c(sites,"NA","left"),c(sites,"NA","left")))

# Filling movement-Matrix construction

for(from in 1:N){
    for(to in 1:N){
      if(from==to){move[from,to]<-p.stay[w]} else {move[from,to]<-p.move[from,to]}
      move[,(N+1)]<-(1-(p.leave[w]+p.move.out[w]))/N
      move[,(N+2)]<-(1-(p.leave[w]+p.move.out[w]))/N
      move[(N+1),]<-p.move.out[w]
      move[(N+2),]<-p.leave[w] 
}

}

The idea is to use this cumulative probability matrix to determine the fate of an animal based on random number,

 cumsum.move<-cumsum(data.frame(move)) # Cumulative sum of probabilities

In this cumulative matrix, the letters "A","B","C","D" and "E" represent different sites, "NA" represents the probability of leaving and coming back on a future time step and "left" represents the probability of leaving the system and not coming back. Then I use a list of random numbers to compare against the cumulative probability matrix and determine the "fate" of that particular animal.

for(o in 1:reps){

result<-matrix(as.character(""),steps) # Vector for storing sites
x<-sample(random,steps,replace=TRUE) # sample array of random number 
time.step<-data.frame(x) # time steps used in the simulation (i)
colnames(time.step)<-c("time.step")
time.step$event<-""

j<-sample(1:N,1,replace=T) # first column to be selected 
k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out  

for(i in 1:steps){
  for (t in 1:(N+1)){
    if(time.step$time.step[i]<cumsum.move[t,j]){
    time.step$event[i]<-to.r[t]
    break
   }
 }

 ifelse(time.step$event[i]=="",break,NA) 
 result[i]<-time.step$event[i]
 j<-which(to.r==result[i]) 
 if(length(j)==0){j<-k} 
}

result<-time.step$event

# calculate frequency/use for each replicate

use<-table(result)
use.tab<-data.frame(use)
use.tab1<-use.tab[-which(use.tab==""),]
mergeuse<-merge(use.tab2,use.tab,all.x=TRUE)
mergeuse[is.na(mergeuse)]<-0

# insert data into empty matrix

rep.movements[o,]<-result
rep.use[o,]<-mergeuse$Freq

}

} 
  # for the outer loop I have some matrices to store the results for each parameter,
  # but for this example this is not important
rep.movements
rep.use

Now, the main problem is that it takes to long to run all the simulations for each initial parameter (10 values in this example). I need to find a better/more efficient way to run 1000 simulations / 20 sites across all the initial parameters. I am not too familiar with functions or other ways to speed up this task. Any ideas or recommendations will be appreciated.

Thanks a lot in advance,

user1626688
  • 1,583
  • 4
  • 18
  • 27

1 Answers1

1

Let's wrap your code in a function first. I also added set.seed commands to make the result reproducible. You need to remove those before running the simulation.

sim1 <- function(reps=50, steps=100 ) {

  N<-5 # number of sites
  sites<-LETTERS[seq(from=1,to=N)]
  to.r<-rbind(sites)

  p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
  p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
  p.move.out<-0.01*p.move.r # prob of moving in/out
  p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

  set.seed(42)
  random<-runif(10000,0,1) # generating numbers from a random distribution

  cumsum.move <- read.table(text="A         B         C         D         E    NA.   left
                            A    0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
                            B    0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
                            C    0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
                            D    0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
                            E    0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
                            NA.   0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
                            left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)

  cumsum.move <- as.matrix(cumsum.move)

  for(o in 1:reps){

    result<-matrix(as.character(""),steps) # Vector for storing sites
    set.seed(42)
    x<-sample(random,steps,replace=TRUE) # sample array of random number 
    time.step<-data.frame(x) # time steps used in the simulation (i)
    colnames(time.step)<-c("time.step")
    time.step$event<-""

    set.seed(41)
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40)
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out  

    for(i in 1:steps){
      for (t in 1:(N+1)){
        if(time.step$time.step[i]<cumsum.move[t,j]){
          time.step$event[i]<-to.r[t]
          break
        }
      }

      ifelse(time.step$event[i]=="",break,NA) 
      result[i]<-time.step$event[i]
      j<-which(to.r==result[i]) 
      if(length(j)==0){j<-k} 
    }

    result<-time.step$event
  }
  result
}

Note that result is overwritten during each iteration over o. I don't think you want that, so I fixed that. Also, you use a data.frame inside a loop. As a general rule you should avoid data.frames inside loops like the plague. Although they are very convenient, in terms of efficiency they are terrible.

sim2 <- function(reps=50, steps=100) {

  N<-5 # number of sites
  sites<-LETTERS[seq(from=1,to=N)]
  to.r<-rbind(sites)

  p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
  p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
  p.move.out<-0.01*p.move.r # prob of moving in/out
  p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

  set.seed(42)
  random<-runif(10000,0,1) # generating numbers from a random distribution

  cumsum.move <- read.table(text="A         B         C         D         E    NA.   left
                            A    0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
                            B    0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
                            C    0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
                            D    0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
                            E    0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
                            NA.   0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
                            left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)

  cumsum.move <- as.matrix(cumsum.move)

  res <- list()
  for(o in 1:reps){

    result<-character(steps) # Vector for storing sites
    set.seed(42)
    time.step<-sample(random,steps,replace=TRUE) # sample array of random number 
    #time.step<-data.frame(x) # time steps used in the simulation (i)
    #colnames(time.step)<-c("time.step")
    #time.step$event<-""
    event <- character(steps)

    set.seed(41)
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40)
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out  

    for(i in 1:steps){
      for (t in 1:(N+1)){
        if(time.step[i]<cumsum.move[t,j]){
          event[i]<-to.r[t]
          break
        }
      }

      ifelse(event[i]=="",break,NA) 
      result[i]<-event[i]
      j<-which(to.r==result[i]) 
      if(length(j)==0){j<-k} 
    }

    res[[o]]<-event
  }
  do.call("rbind",res)
}

Do both functions give the same result?

res1 <- sim1()
res2 <- sim2()
all.equal(res1,res2[1,])
[1] TRUE

Is the new version faster?

library(microbenchmark)
microbenchmark(sim1(),sim2())

Unit: milliseconds
    expr       min        lq    median        uq       max
1 sim1() 204.46339 206.58508 208.38035 212.93363 269.41693
2 sim2()  77.55247  78.39698  79.30539  81.73413  86.84398

Well, a factor of three is quite nice already. I don't see much possibilities to further improve the loops, because of those breaks. That leaves only parallelization as an option.

sim3 <- function(ncore=1,reps=50, steps=100) {
  require(foreach)
  require(doParallel)


  N<-5 # number of sites
  sites<-LETTERS[seq(from=1,to=N)]
  to.r<-rbind(sites)

  p.move.r<-seq.int(0.05,0.90,by=0.05) # prob of moving to a new site
  p.leave<-0.01*p.move.r # prob of leaving the system w/out returning
  p.move.out<-0.01*p.move.r # prob of moving in/out
  p.stay<-1-(p.move.r+p.leave+p.move.out) # prob of staying in the same site 

  set.seed(42)
  random<-runif(10000,0,1) # generating numbers from a random distribution

  cumsum.move <- read.table(text="A         B         C         D         E    NA.   left
                            A    0.0820000 0.3407822 0.6392209 0.3516242 0.3925942 0.1964 0.1964
                            B    0.1254937 0.4227822 0.6940040 0.3883348 0.4196630 0.3928 0.3928
                            C    0.7959865 0.8730183 0.7760040 0.7930623 0.8765180 0.5892 0.5892
                            D    0.8265574 0.8980259 0.8095507 0.8750623 0.9000000 0.7856 0.7856
                            E    0.9820000 0.9820000 0.9820000 0.9820000 0.9820000 0.9820 0.9820
                            NA.   0.9910000 0.9910000 0.9910000 0.9910000 0.9910000 0.9910 0.9910
                            left 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000 1.0000",header=TRUE)

  cumsum.move <- as.matrix(cumsum.move)

  #res <- list()
  #for(o in 1:reps){
  cl <- makeCluster(ncore)
  registerDoParallel(cl)
  res <- foreach(1:reps) %dopar% {

    result<-character(steps) # Vector for storing sites
    set.seed(42)
    time.step<-sample(random,steps,replace=TRUE) # sample array of random number 
    #time.step<-data.frame(x) # time steps used in the simulation (i)
    #colnames(time.step)<-c("time.step")
    #time.step$event<-""
    event <- character(steps)

    set.seed(41)
    j<-sample(1:N,1,replace=T) # first column to be selected 
    set.seed(40)
    k<-sample(1:N,1,replace=T) # selection of column for ind. that move in/out  

    for(i in 1:steps){
      for (t in 1:(N+1)){
        if(time.step[i]<cumsum.move[t,j]){
          event[i]<-to.r[t]
          break
        }
      }

      ifelse(event[i]=="",break,NA) 
      result[i]<-event[i]
      j<-which(to.r==result[i]) 
      if(length(j)==0){j<-k} 
    }

    #res[[o]]<-event
    event
  }
  stopCluster(cl)
  do.call("rbind",res)
}

Same result?

res3 <- sim3()
all.equal(res1,c(res3[1,]))
[1] TRUE

Faster? (Let's use 4 cores on my Mac. You might try to get access to a server with a few more cores.)

microbenchmark(sim1(),sim2(),sim3(4))
Unit: milliseconds
     expr        min         lq     median         uq       max
1  sim1()  202.28200  207.64932  210.32582  212.69869  255.2732
2  sim2()   75.39295   78.95882   80.01607   81.49027  125.0866
3 sim3(4) 1031.02755 1046.41610 1052.72710 1061.74057 1091.2175

That looks terrible. However, the test is unfair to the parallel function. The function gets called 100 times with only 50 replicates. That means we get all the overhead of parallelization but almost no benefit from it. Let's make it more fair:

microbenchmark(sim1(rep=10000),sim2(rep=10000),sim3(ncore=4,rep=10000),times=1)
Unit: seconds
                          expr      min       lq   median       uq      max
1            sim1(rep = 10000) 42.16821 42.16821 42.16821 42.16821 42.16821
2            sim2(rep = 10000) 16.13822 16.13822 16.13822 16.13822 16.13822
3 sim3(ncore = 4, rep = 10000) 38.18873 38.18873 38.18873 38.18873 38.18873

Better, but still not impressive. If the numbers of replicates and steps were increased further, the parallel function would look good, but I don't know if you need that.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thanks a lot Roland, I actually did some edits to the original code to include a more detailed/reproducible example of the code I am using (which also has some additional loops...). I am going to check your script first. Thanks for your input! – user1626688 Nov 18 '12 at 11:15
  • Hi Roland, I am not sure if this code will be useful when including the more detailed edits... since I have to run the simulations for each parameter. – user1626688 Nov 18 '12 at 11:57
  • I am sure that your new code can be optimized in a similar way. The main point is to use efficient data structures (vectors or lists) to store values. Use the `Rprof` function to find out which operation takes the most time and try to optimize that code. Use vectorized functions whenever possible. And look into parallelization. – Roland Nov 18 '12 at 12:10
  • Hi Roland, what is the purpose of the function do.call("rbind",res)? – user1626688 Nov 19 '12 at 03:00
  • That combines all vectors in a list into a matrix or data.frame. – Roland Nov 19 '12 at 07:59