0

I am trying to optimise a code that I have written using the apply() and similar functions (e.g. lapply()). Unfortunately I do not see much of improvement so searching I came across this post apply() is slow - how to make it faster or what are my alternatives? where a suggestion is to use the function with() instead of apply() which is certainly much faster.

What I want to do is to apply a user defined function to every row of a matrix. This function takes as input the data from the row, makes some calculations and returns a vector with the results. A toy example where I use the apply() function, the with() and a vectorized version:

#Generate a matrix 10x3
prbl1=matrix(runif(30),nrow=10)
prbl2=data.frame(prbl1)
prbl3=prbl2

#function for the apply()
fn1=function(row){
  x=row[1]
  y=row[2]
  z=row[3]
  k1=2*x+3*y+4*z
  k2=2*x*3*y*4*z
  k3=2*x*y+3*x*z
  return(c(k1,k2,k3))
}

#function for the with()
fn2=function(x,y,z){
  k1=2*x+3*y+4*z
  k2=2*x*3*y*4*z
  k3=2*x*y+3*x*z
  return(c(k1,k2,k3))
}

#Vectorise fn2
fn3=Vectorize(fn2)



 #apply the functions:
rslt1=t(apply(prbl1,1,fn1))
rslt2=t(with(prbl2,fn2(X1,X2,X3)))
rslt2=cbind(rslt2[1:10],rslt2[11:20],rslt2[21:30])
rslt3=t(with(prbl3,fn3(X1,X2,X3)))

All three produce the same output, a matrix 10x3 which is what I want. Nevertheless, notice at rslt2 that I need to bind the results as the output of using with() is a vector of length 300. I suspected that this is due to the fact that the function is not vectorised (if I understood this correctly). In rslt3 I am using a vectorised version of fn2 which generated the output in the expected way.

When I compare the performance of the three, I get:

library(rbenchmark)
benchmark(rslt1=t(apply(prbl1,1,fn1)),
          rslt2=with(prbl2,fn2(X1,X2,X3)),
          rslt3=with(prbl3,fn3(X1,X2,X3)),
          replications=1000000)

   test replications elapsed relative user.self sys.self user.child sys.child
1 rslt1      1000000  103.51    7.129    102.63     0.02         NA        NA
2 rslt2      1000000   14.52    1.000     14.41     0.01         NA        NA
3 rslt3      1000000  123.44    8.501    122.41     0.05         NA        NA

where with() without vectorisation is definitely faster.

My question: Since rslt2 is the most efficient approach, is there a way that I can use this correctly without the need to bind the results afterwards? It does the job but I feel is not efficient coding.

Community
  • 1
  • 1
KGeor
  • 107
  • 6
  • 1
    `with` is not used to apply a function to every row. What it does is attached the columns of the dataframe to your working environment so you can reference them as variables. The reason `with` is faster here is because it uses a vectorized version of multiplication working with all your variables while apply is being called row-by-row. – Karolis Koncevičius Jun 22 '15 at 16:59
  • 2
    So for example you can call your function like this: `fn2(prbl1[,1], prbl1[,2], prbl1[,3])`. No need for any `with`. – Karolis Koncevičius Jun 22 '15 at 17:00

1 Answers1

2

The first and third functions you give are being applied 1 row at a time, so are called 10 times in your example. The second function is taking advantage of the fact that multiplication and addition in R are already vectorised and so using any form of loop or ply function is unnecessary. The function is only called once. If you wanted to use your current code, all you'd need to do is change the c to cbind in fn2.

fn2=function(x,y,z){
  k1=2*x+3*y+4*z
  k2=2*x*3*y*4*z
  k3=2*x*y+3*x*z
  return(cbind(k1,k2,k3))
}

All that with does is evaluate the expression it's given in the list, data.frame or environment given. So with(prbl2,fn2(X1,X2,X3)) is entirely equivalent to fn2(prbl2$X1, prbl2$X2, prbl2$X3).

Is this your real function? If it is, then problem solved. If not, then it depends on whether your real function consists entirely of operations and functions that already are vectorised or can be replaced with vectorised equivalents.

For the amended function per the comments:

Single row:

fn1 <- function(row){
  x <- row[1]
  y <- row[2]
  z <- row[3]
  k1 <- 2*x+3*y+4*z
  k2 <- 2*x*3*y*4*z
  k3 <- 2*x*y+3*x*z
  if (k1>0 & k2>0 &k3>0){
    return(cbind(k1,k2,k3))
  } else {
    k1 <- 5*x+3*y+4*z
    k2 <- 5*x*3*y*4*z
    k3 <- 5*x*y+3*x*z
    if (k1<0 || k2<0 || k3<0) {
      return(cbind(0,0,0))
    } else {
      return(cbind(k1,k2,k3))
    }
  }
}

Whole matrix:

fn2 <- function(mat) {
  x <- mat[, 1]
  y <- mat[, 2]
  z <- mat[, 3]
  k1 <- 2*x+3*y+4*z
  k2 <- 2*x*3*y*4*z
  k3 <- 2*x*y+3*x*z
  l1 <- 5*x+3*y+4*z
  l2 <- 5*x*3*y*4*z
  l3 <- 5*x*y+3*x*z
  out <- array(0, dim = dim(mat))
  useK <- k1 > 0 & k2 > 0 & k3 > 0
  useL <- !useK & l1 >= 0 & l2 >= 0 & l3 >= 0
  out[useK, ] <- cbind(k1, k2, k3)[useK, ]
  out[useL, ] <- cbind(l1, l2, l3)[useL, ]
  out
}
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Many thanks for the answer. As you suspected this is not the real function. The actual function I am using goes on after the calculation of k1, k2 and k3 and checks whether they are all positive or not (in this example they are always positive) and if not make some more calculations and so on. I am using an if() statement, but when I am doing this without vectorisation I receive the expected error: the condition has length > 1 and only the first element will be used. Is there any way to keep this form of the function, use an if() statement and skip the vectorisation? – KGeor Jun 23 '15 at 22:00
  • You can usually replace `if` in that circumstance with `ifelse`. So, instead of `if (x == 1) y else z` you would use `ifelse(x == 1, y, z)` If you can post your full function I can be more specific. – Nick Kennedy Jun 24 '15 at 00:34
  • Let's say that I want to run a function like this: `fn1=function(row){ x=row[1] y=row[2] z=row[3] k1=2*x+3*y+4*z k2=2*x*3*y*4*z k3=2*x*y+3*x*z if(k1>0 & k2>0 &k3>0){ return(c(k1,k2,k3)) }else{ k1=5*x+3*y+4*z k2=5*x*3*y*4*z k3=5*x*y+3*x*z if(k1<0 || k2<0 || k3<0){ return(cbind(0,0,0)) }else{ return(cbind(k1,k2,k3)) } } }` Would it be possible to still use ifelse() or I need a vectorised version? – KGeor Jun 24 '15 at 12:01