0

I have a function that I'm applying to different sets of coordinates to create four new columns in my tibble. This function has a pretty long start-up time (loads the genome into RAM, converts tibble to GRanges, and retrieves sequences) but is relatively fast, so that there's not much difference between 100 and 1,000,000 sequences. Is there any way to send each col in the mutate to a different core so they can be processed at the same time? I thought about using pivot_long and then group+partition but this got me thinking about whether there was a different way to accomplish this. A multi_mutate of sorts?
(I don't actually expect the multiplyr partition/collect to be that time-saving in my case given the small cost to additional coordinates, but if I could avoid the time cost of pivoting, which is still relatively small, and mess in my code, that'd be cool.)

GenesRus
  • 1,057
  • 6
  • 16
  • can you share a minimal example of what you have right now? – Edo Sep 11 '20 at 08:21
  • I'm more asking if something like `multi_mutate` exists rather than help with a specifc block of code, but sure. What I have currently is essentially `d %>% mutate(c1 = long_f(a1,b1), c2 = long_f(a2,b2), c3 = long_f(a3,b3), c4 = long_f(a4,b4))` where a/b columns are integer columns and c columns are the resulting strings (genomic sequences). I could make an ID col and pivot_longer, group_by ID, and then use a traditional multdplyr partition on groups, but I would prefer to avoid pivoting back and forth if possible. It seems like it ought to be doable to send each mutate to a diff core. – GenesRus Sep 11 '20 at 18:49

1 Answers1

2

I know you were looking for an existing package, but I couldn't find anything on that. Other similar questions (like here or here) appear not to provide a package either..

However, what about you hack it out yourself... Look at this example with furrr.

### libraries
library(dplyr)
library(furrr)

### data complaint with your example
d <- replicate(8, rnorm(100))
colnames(d) <- apply(expand.grid(letters[1:2], 1:4), 1, paste0, collapse = "")
d <- as_tibble(d)

### a function that take more than a second to finish..
long_f <- function(x1, x2){
  
  Sys.sleep(1)
  x1+x2
  
}

### multimutate!
multimutate <- function(.data, ..., .options = future_options()){
  
  dots <- enquos(..., .named = TRUE)
  .data[names(dots)] <- future_map(dots, ~rlang::eval_tidy(., data = .data, env = parent.frame()), .options = .options)
  .data
  
}


# no future strategy implemented
tictoc::tic()
d %>%
  multimutate(c1 = long_f(a1,b1), 
              c2 = long_f(a2,b2),
              c3 = long_f(a3,b3), 
              c4 = long_f(a4,b4))  
tictoc::toc()
# 4.34 sec elapsed

# future strategy
plan(multiprocess)
tictoc::tic()
d %>%
  multimutate(c1 = long_f(a1,b1), 
              c2 = long_f(a2,b2),
              c3 = long_f(a3,b3), 
              c4 = long_f(a4,b4),
              .options = future_options(globals = "long_f"))  
tictoc::toc()
# 1.59 sec elapsed

It needs some testing a guess.. and It would need to be improved.. for example using the same methods available for mutate. But it's a start.

Notice that I need to use future_options..

Edo
  • 7,567
  • 2
  • 9
  • 19
  • This is amazing! :D Thanks. It does look like this might have compatibility issues with Windows and/or RStudio given the instability of forking, according to the warning, but it works perfectly for me since I always work on Mac or Linux machines. In any case, I wasn't aware of furrr and will definitely be playing around with it in the future. – GenesRus Sep 15 '20 at 19:13
  • I tried it on windows and with future_options it works. – Edo Sep 16 '20 at 05:31
  • Awesome! I must have misunderstood the warning. – GenesRus Sep 16 '20 at 22:43
  • It doesn't seem to know where to find function imported from a sourced script or from packages (unless declared with package::function). Would I need to modify the eval_tidy env variable? – GenesRus Sep 17 '20 at 06:18
  • you need to pass it through future_options – Edo Sep 17 '20 at 07:48
  • [future_options](https://www.rdocumentation.org/packages/furrr/versions/0.1.0/topics/future_options) – Edo Sep 17 '20 at 07:52
  • Let me know if it's best to address this in a separate question, but shouldn't the global environment be loaded by default? I even tried to force it but it can't find my function with `.options = future_options( globals = structure(TRUE, add = "R_GlobalEnv")` despite environment(long_f) outputting ``. Am I missing something obvious? – GenesRus Sep 17 '20 at 16:45
  • 1
    that's not how you are suppose to use future_options. too long to explain here – Edo Sep 17 '20 at 17:05