0

I have a very large data.table in R (~200,000) entries, and I want to perform a non-vectorized function to each row. This function needs inputs from two columns of this data.table. The value of one column is linked to another list with each member containing ~1,000,000 numbers. Here is a simplified case with mtcars

#setup a fake list for my function call    
gears <- mtcars %>% arrange(gear) %>% pull(gear) %>% unique
gear_lst <- lapply(gears, function(x){rnorm(1000000, mean = x**2, sd = x*2)}) %>% setNames(.,gears)  

#make a mega data table     
mega_mtcars <- sapply(mtcars, rep.int, times = 10000) %>% as.data.table

#this is the function I want to call    
my_function <- function(x,y){
    sum(x > gear_lst[[y]])
}

# rowwise call is low
out <- mega_mtcars %>% mutate(gear_c = as.character(gear)) %>% rowwise %>% mutate(out = my_function(mpg, gear_c))

One thing I tried is to add a nested column of gear_lst for each gear entry, so that I would be able to perform vectorized function. However, because the list is large, the memory failed to created such a data structure.

Update: @akrun provided a few ways, I wasn't able to test them with my original mega_mtcars because it's too big. I sized it down 100 fold and here is the performance so far (it doesn't seem any improvement over the original rowwise method):

#make a smaller mega_mtcars
mega_mtcars <- sapply(mtcars, rep.int, times = 100) %>% as.data.table

# use rowwise from dplyr
system.time(mega_mtcars %>% rowwise %>% mutate(out = my_function(mpg, as.character(gear))))
   user  system elapsed 
  8.086   2.860  10.941 
    
# use Map with data.table
system.time(mega_mtcars[, out := unlist(Map(my_function, x = mpg, y = as.character(gear)))])
  user  system elapsed 
  7.843   2.815  10.654 
    
# use dapply from collapse package
system.time(dapply(mega_mtcars[, .(mpg, gear)], MARGIN = 1, function(x) my_function(x[1], as.character(x[2]))))
   user  system elapsed 
  7.957   3.167  11.127 

Any other ideas?

coffee
  • 85
  • 7
  • would you mind to add `julia` tag to your post? I recently found about the language and like to see how vectorized computation is compared to fast loop. – هنروقتان Jan 06 '22 at 21:47

2 Answers2

2

With data.table, rowwise can be achieved by grouping over the row sequence

library(data.table)
mega_mtcars[, out := my_function(mpg, as.character(gear)) , 
       by = 1:nrow(mega_mtcars)]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    This is really great @akrun and motivates me to explore `data.table` more. With my tries it was not possible! – TarJae Jan 02 '22 at 18:45
  • 1
    @TarJae you may also use `Map` here i.e `Map(my_function, x = mpg, y = as.character(gear))` without the grouping – akrun Jan 02 '22 at 18:47
  • This is interesting, but it doesn't seem be faster than rowwise method. Am I wrong? – coffee Jan 02 '22 at 19:02
  • @coffee it wouldn't be because both are doing similar things. If you want to make it faster, try with `Map` i.e. `mega_mtcars[, out := unlist(Map(my_function, x = mpg, y = as.character(gear))]` – akrun Jan 02 '22 at 19:03
  • @akrun it performs the same, unfortunately – coffee Jan 02 '22 at 19:19
  • @coffee then you may need `dapply` from `collapse` i.e. `dapply(mega_mtcars[, .(mpg, gear)], MARGIN = 1, function(x) my_function(x[1], as.character(x[2])))` which is a bit more optimized for rowwise operations – akrun Jan 02 '22 at 19:21
  • @akrun, still not better. I have updated the original post with the benchmark from your suggestions. – coffee Jan 02 '22 at 19:38
  • 1
    @coffee rowwise operations are time consuming. You may split into chunks and do a parallel operations – akrun Jan 02 '22 at 19:41
  • @coffee Also, the `as.character` step can be outside ie. `mega_mtcars[, gear := as.character(gear)]` so that we don't have to convert to character in each iteration – akrun Jan 02 '22 at 19:42
  • @coffee based on your updated example data, a `for` loop timing is better `system.time({ out <- numeric(nrow(mega_mtcars)); mega_mtcars[, gear := as.character(gear)]; for(i in seq_along(out)) { out[i] <- my_function(mega_mtcars$mpg[i], mega_mtcars$gear[i]) } })` – akrun Jan 02 '22 at 19:58
  • @akrun, I guess I will have to look into parallel processing. I've used `multidplyr`before, but it doesn't work with `rowwise`. – coffee Jan 02 '22 at 22:37
  • @akrun using dev version even enables `mega_mtcars[, out := my_function(mpg, as.character(gear)), by=.I]` now ;) – Ben373 Jan 06 '22 at 19:48
  • @Ben373 thanks, does it improves the efficiency when you use `.I` – akrun Jan 06 '22 at 19:49
  • @akrun nah only plays more nicely together when you also select with `DT[i, ]` than `1:nrow(DT)`. – Ben373 Jan 06 '22 at 19:56
-1

does sorting the values in gear_lst help?

  • I guess it might, but I doubt it would help much because `my_function` is a vectorized operation and it's very fast. – coffee Jan 02 '22 at 22:35
  • i see, then I guess just run the naive one, because the running time should be around 10.94 * 100 = 1094 seconds – User1909203 Jan 02 '22 at 23:52
  • But my real dataset as well as the computation is more complicated than this, and it's taking forever. – coffee Jan 03 '22 at 14:34