3

Firstly, let us generate data like this:

library(data.table)
data <- data.table(date = as.Date("2015-05-01")+0:299)
set.seed(123)
data[,":="(
   a = round(30*cumprod(1+rnorm(300,0.001,0.05)),2),
   b = rbinom(300,5000,0.8)
 )]

Then I want to use my custom function to operate multiple columns multiple times without manually typing out .Such as my custom function is add <- function(x,n) (x+n)

I provide my for loops code as following:

add <- function(x,n) (x+n)
n <- 3
freture_old <- c("a","b")
for(i in 1:n ){
  data[,(paste0(freture_old,"_",i)) := add(.SD,i),.SDcols =freture_old ]
}

Could you please tell me a lapply version to instead of for loop?

Tao Hu
  • 287
  • 2
  • 12
  • 1
    What's wrong with a `for` loop? Note: `lapply` is also a loop. – Parfait May 02 '19 at 18:00
  • In R, the `lapply` loop seems to run faster than the `for` loop. – Jason Johnson May 02 '19 at 19:22
  • 2
    I agree with Parfait re using for loop here, though you ought to use lapply to iterate over .SD, like `for (i in 1:n) data[, paste0(freture_old,"_",i) := lapply(.SD, add, i), .SDcols=freture_old][]` – Frank May 02 '19 at 20:46
  • 1
    @Frank - Yes +1 for iterating over the `.SD` in the `data.table` and I will modify my answer to include that since that is really where the speed factor comes in using a `data.table`. And I tend to agree for small n there is no real speed difference between using @Frank's solution and my solution. However, once n >10000 there is a more noticeable speed boost for using `lapply` outside the `data.table` rather than the `for` loop. – Jason Johnson May 02 '19 at 22:10
  • @Frank Could you please explain why I should use lapply to iterate over .SD ?I know this two methods can be worked ,but I don't know which is better. I am looking forward to your answer. – Tao Hu May 03 '19 at 03:21
  • @Frank I have noticed that there is a null `[]` at the end in your code .And if i don't write it ,I must type `data` twice in order to return the data.Could you please explain it? Thanks a lot. – Tao Hu May 03 '19 at 03:34
  • .SD is a subset of the data.table, which is a list of vectors. `lapply` is designed for iterating over such lists so you can add the number to each vector. Adding a number to a table might work, but I'm guessing it's messy and maybe inefficient (eg, coercing the table to a data.frame). Re the [], the data.table package author gives some background here https://stackoverflow.com/a/15268392 – Frank May 03 '19 at 10:43

2 Answers2

5

If all you want is to use an lapply loop instead of a for loop you really do not need to change much. For a data.table object it is even easier since every iteration will change the data.table without having to save a copy to the global environment. One thing I add just to suppress the output to the console is to wrap an invisible around it.

lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old])

Note that if you assign this lapply to an object you will get a list of data.tables the size of the number of iterations or in this case 3. This will kill memory because you are really only interested in the final entry. Therefore just run the code without assigning it to a variable. Now if you do not assign it to anything you will get every iteration printed out to the console. So what I would suggest is to wrap an invisible around it like this:

invisible(lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old]))

Hope this helps and let me know if you need me to add anything else to this answer. Good luck!

Jason Johnson
  • 451
  • 3
  • 7
1

An option without R "loop" (quoted since ultimately its a loop at certain level somewhere):

data[,
    c(outer(freture_old, seq_len(n), paste, sep="_")) :=
        as.data.table(matrix(outer(as.matrix(.SD), seq_len(n), add), .N)),
    .SDcols=freture_old]

Or equivalently in base R:

setDF(data)
cbind(data, matrix(outer(as.matrix(data[, freture_old]), seq_len(n), add), 
    nrow(data)))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35