0

From the data.table package website, given that:

"many common operations are internally parallelized to use multiple CPU threads"

  • I would like to know if that is the case when Map() is used within a data.table?

The reason for asking is because I have noticed that comparing the same operation on a large dataset (cor.test(x, y) with x = .SD and y being a single column of the dataset), the one using Map() performs quicker than when furrr::fututre_map2() is used.

Buzz B
  • 75
  • 7
  • 1
    `future::future_map2` incurs the overhead of transferring data between processes; if the data is large, then this is a not-insignificant amount of time (with respect to computation time) and can impact the user-time doing something. To answer your question, `Map` is not parallelized. – r2evans Aug 29 '21 at 10:01
  • @r2evans Many thanks, I thought it was due to the overhead, but just wanted to make sure I wasn't making an oversight. – Buzz B Aug 29 '21 at 11:26

1 Answers1

1

You can use this rather explorative approach and see whether the time elapsed shrinks when more threads are used. Note that on my machine the maximum number of usable threads is just one, so no difference is possible

library(data.table)

dt <- data.table::data.table(a = 1:3,
                             b = 4:6)
dt
#>    a b
#> 1: 1 4
#> 2: 2 5
#> 3: 3 6

data.table::getDTthreads()
#> [1] 1

# No Prallelisation ----------------------------------
data.table::setDTthreads(1)
system.time({
  
  dt[, lapply(.SD,
              function(x) {
                Sys.sleep(2)
                x}
  )
  ]
})
#>    user  system elapsed 
#>   0.009   0.001   4.017

# Parallel -------------------------------------------
# use multiple threads
data.table::setDTthreads(2)
data.table::getDTthreads()
#> [1] 1

# if parallel, elapsed should be below 4
system.time({
  
  dt[, lapply(.SD,
              function(x) {
                Sys.sleep(2)
                x}
  )
  ]
})
#>    user  system elapsed 
#>   0.001   0.000   4.007

# Map -----------------------------------------------
# if parallel, elapsed should be below 4
system.time({
  
  dt[, Map(f = function(x, y) {
    Sys.sleep(2)
    x},
    .SD,
    1:2
    
  )
  ]
})
#>    user  system elapsed 
#>   0.002   0.000   4.005
mnist
  • 6,571
  • 1
  • 18
  • 41
  • Thank you, this is a really useful approach and I will definitely use it to interrogate my work going forward. I only have 2 physical cores to test, so could not fully explore time gain (if any) thoroughly. – Buzz B Aug 29 '21 at 11:29
  • I'll "accept" the answer to close the question. Explicit answer is in the comment by @r2evans of the original post. – Buzz B Aug 30 '21 at 12:31