How to avoid multiple loops with multiple variables in R

Question

I have a two datasets stored in tables, one is a set of [a, b] and another is [x, Sx, y, Sy, rho]. I have a probability function f that requires (a, b, x, Sx, y, Sy, rho). In the end I want to find the sum of the probability results over all [x, Sx, y, Sy, rho] for the first [a, b]. Then find the sum for all [x, Sx, y, Sy, rho] over the second [a, b], etc...

I would like to have a few hundred rows in the [x, Sx, y, Sy, rho] file and a few hundred thousand rows in the [a, b] file.

I'm wondering if there is a way to do this without using two loops? I've tried the following, and it doesn't quite work the way I want it to, but I know it will be far too slow.

I don't know if it will help but I've added the function in the code. Sorry that the function itself is a mess and not formatted properly.

# data  file with (a, b)
data            <- matrix( c(1, 0, 1, 1, 0.5, 0), nrow=3, ncol=2) 
colnames(data)  <- c("a", "b") 
Ndat            <- dim(data)
Ndata           <- Ndat[1]

# data2 file with (x, Sx, y, Sy, rho)
data2           <- matrix( c(1, 0.1, 1, 0.1, 0.002, 2, 0.1, 2, 0.1, 0.000001, 
                             2, 0.1, 1, 0.1, 0.002), nrow=3, ncol=5) 
colnames(data2) <- c("x", "Sx", "y", "Sy", "rho") 
Ndat2           <- dim(data)
Ndata2          <- Ndat[1]

# function requires variables (a, b, s, Sx, y, Sy, rho) 
Prob  <- function(a, b, Xi, sX, Yi, sY, rho) {sqrt(1 + a ^ 2) * (
  exp(-((b + a * Xi - Yi) ^ 2 / (
    2 * ((a ^ 2 * sX ^ 2) - 
         (2 * a * rho * sX * sY) + sY ^ 2)))) * sqrt((
           1 - rho ^ 2) / (
             a ^ 2 * sX ^ 2 - 2 * a * rho *sX *sY + sY ^ 2))/(
               sqrt(2 * pi) * sqrt(1 - rho ^ 2)))
    }

# Here is my weak attempt
Table <- NULL
Table <- for (j in 1:Ndata) { 
   sum (for (i in 1:Ndata2) {
   Datatable[i] = Prob(data[j, a], data[j, b], data2[i, x], 
                 data2[i, Sx], data2[i, y], data2[i, Sy], 
                 data2[i, rho])
   })
}

I am having a very hard time wrapping my head around the apply functions and when they can/should be used. I know that I've probably not added enough information, so any suggestions that can help me out would be great. I'm pretty new to programming as well as R, so please forgive any inappropriate vocabulary or formatting.

There is probably a better way to define the number or rows in data to get Ndata as a global, but these are the first I stumbled across.

The function should not be recursive, but I see now that it is as I've written it. I have spent many hours on intro tutorials to R and still am having a very hard time understanding how the apply suite of functions are best implemented.

I would like one iteration to apply this function to each row in data2 using a, b from the first row of data. Then sum the probability for all of those. Then the next iteration should sum all of the probabilities for row 2 of data using a, b applied to every row of data2

Are you trying for some form of recursive programming? Your `f` function calls itself, but it isn't clear how its supposed to work. — gung - Reinstate Monica, Dec 19 '14 at 19:24

jlhoward · Answer 1 · 2014-12-19T21:18:25.287

I have a feeling there's an easier way to do this, but something like this will probably work.

f <- function(a,b,x,y,z) a+b+x+y+z
f.new <- function(p1,p2) {
  p1=as.list(p1); p2=as.list(p2)
  f(p1$a,p1$b,p2$x,p2$y,p2$z)
}

data1 <- data.frame(a=1:10,b=11:20)
data2 <- data.frame(x=1:5,y=21:25,z=31:35)
indx  <- expand.grid(indx2=seq(nrow(data2)),indx1=seq(nrow(data1)))
result <- with(indx,f.new(data1[indx1,],data2[indx2,]))
sums   <- aggregate(result,by=list(rep(seq(nrow(data1)),each=nrow(data2))),sum)

You seem to want to evaluate a function for every combination of two sets of variables, the set of (a,b) and the set of (x, Sx, y, Sy, rho), then sum over the second set, for every instance of the first set.

So first this redefines the function f(...) to take two arguments, representing the two sets. This is f.new(...). You should probably define your original function that way - it will run faster.

Then we create a data frame, indx that has two columns, representing every combination of the row numbers in data1 and data2, then we call f.new(...) using data1 and data2 indexed using indx. This produced result which has the function evaluated at every combination of (a,b) and (x,y,z). Then we aggregate that to get the sums you specified.

This approach is memory intensive; result will have ~ 10MM elements, but will run faster than loops.

thanks very much @jlhoward, I think this will work quite well, however I can't get the `f.new(...)` function to return `results`. I think there may be a problem in that my data is not formatted cleanly, there are other variables within the tables where the data is stored. Is there a way to tell the new function where to find the variables using column titles? I'm having a hard time understanding where and how the variables are called from the sets of data you have created here. — Jesse, Feb 11 '15 at 05:55

How to avoid multiple loops with multiple variables in R

1 Answers1

Linked