1

I recently began experimenting with R as a language to use for genetic programming. I have slowly but surely been learning more and more about how R works and its best coding practices. Yet, I have hit a road block. Here is my situation. I have a dataset with roughly 700 rows, each row has 400 or so columns. I have everything setup that a function with a number of parameters the same as the number of columns gets sent as a parameter into an evaluation (fitness scoring) function. I want to go row by row in the dataset and pass the values in each column in a row into the function being evaluated. The first problem was figuring out how to pass in the parameters separately into the function. By "separately" I mean that the function expects 400 parameters, not a vector of length 400. To do this I used the following:

do.call(function,as.list(parameters))

Where parameters is a vector of a month variable (1-12) that is appended to the values in a row in the dataset. This works fine, I just used a for loop to iterate over the 700 rows in the dataset and then another loop for the 12 months and use the above to accumulate a vector of outputs. The problem is this is painfully slow, around 24-28 seconds per function. And I have 100-500 functions sent into this evaluation every generation of evolution. The bottom line is this is not the way to go. Next I attempted to use the sapply method as below.

outputs <- sapply(1:12,function(m) sapply(rows[1:length(rows)],function(p) do.call(f,as.list(c(p,m)))))

This applied (1-12) as the months and then applied (1-700) as the rows of the dataset. This took just as long. Any ideas on solutions would be helpful.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
Isaac Drachman
  • 984
  • 6
  • 8
  • Have you considered using `ddply` function from the `plyr` package? – Reuben L. May 07 '12 at 06:08
  • 2
    You can use `Rprof` to identify which parts of your code are the slowest. – Vincent Zoonekynd May 07 '12 at 06:14
  • I have taken a look at plyr. How would it be implemented? I have a list of vectors, each vector being a row containing the parameters. I need to send each row into the function along with a month variable. – Isaac Drachman May 07 '12 at 06:22
  • Do you have a mix of factors and numbers or just numbers? Does you function *REALLY* need to take 400 parameters instead of a vector or two?! That function must look pretty messy! – Tommy May 07 '12 at 06:41
  • All the items are numbers. I think I may have to take into account all these parameters. They are in the dataset. – Isaac Drachman May 07 '12 at 06:43
  • Could you include part of the function so we can get a sense of how it works? A version with 4-5 parameters to show the general idea would do fine. – Tommy May 07 '12 at 07:11
  • The function that the parameters are being passed into is different every run of the program. The functions are randomly generated as per genetic programming. The matter doesn't lie in them, just getting the parameters to them efficiently. – Isaac Drachman May 07 '12 at 07:28
  • @IsaacDrachman - I think the matter definitely lies in them :). See my updated answer. – Tommy May 07 '12 at 10:46

1 Answers1

6

The main problem in cases like this is usually that the approach you are taking is the wrong one. I don't know enough about your specific case, but:

  1. Try to vectorize the calculations - so your function should operate on ALL rows instead of just one at a time.
  2. If you just store numbers in a data.frame, converting it to a matrix will usually speed up many operations.
  3. Don't write functions that take 400 parameters! 5 is probably on the high side too.

EDIT Since you generate the function, you should be able to instead generate a different version that takes a vector of values instead of that many parameters. Note that the vector you pass it can have names:

# Convert this:
f <- function(foo, bar) {
  foo+bar
}
do.call(f, list(foo=42, bar=13))

# To this:
f <- function(args) {
  args[["foo"]] + args[["bar"]] 
  # or even faster:
  #args[[0]] + args[[1]]
  # or fastest:
  #sum(args)
}
do.call(f, list(args=c(foo=42, bar=13)))
# or, simply
f(c(foo=42, bar=13))

... calling a function with 1 parameter instead of 400 is about 60x faster! But note that this is just the overhead of calling the function. You need to measure how much time the actual function takes too. If that takes like a second or more, then it doesn't matter how efficiently you call it or how efficient you loops are...

Tommy
  • 39,997
  • 12
  • 90
  • 85
  • @Isaac Drachmann: I agree with Tommy, you need to tell us more about your code in order to get more help than very general hints. – cbeleites unhappy with SX May 07 '12 at 07:44
  • 1
    Addition to Tommy's point 2: data.frames can hold matrices in their columns. So even if all 400 parameters are not of the same type you may be able to group them into a few matrices. You can even refer to those columns in formulas just as you specify "normal" columns. – cbeleites unhappy with SX May 07 '12 at 07:47