6

I have a dataset with few numeric columns and over 100 millions of rows as a data.table object. I would like to do group operations on some of the columns based on other columns. For example, count unique elements of column "a" per each category in column "d".

my_data[, a_count := uniqueN(col_a), col_d]

I have many of these operations which are independent of each other and it would be great to run them in parallel. I have found the following piece of code which will run different functions in parallel.

fun1 = function(x){
  x[, a_count := uniqueN(col_a), col_d]
  return(x[, .(callId, a_count)])
}
fun2 = function(x){
  x[, b_count := uniqueN(col_b), col_d]
  return(x[, .(callId, b_count)])
}
fun3 = function(x){
  x[, c_count := uniqueN(col_c), col_d]
  return(x[, .(callId, c_count)])
}

tasks = list(job1 = function(x) fun1(x),
             job2 = function(x) fun2(x),
             job3 = function(x) fun3(x))

cl = makeCluster(3)
clusterExport(cl, c('fun1', 'fun2', 'fun3', 'my_data', 'data.table', 'uniqueN'))

out = clusterApply( 
  cl,
  tasks,
  function(f) f(my_data)
)
stopCluster(cl)

How can I improve this solution? For example, it would be great to just pass only the essential columns to each function and not the entire dataframe.

hm6
  • 340
  • 2
  • 13
  • What's wrong with your current solution? – F. Privé May 30 '18 at 11:10
  • 1
    It passes the entire "my_data" dataframe to all functions which causes memory limitations. One improvement would be to just pass the two essential columns to each function. – hm6 May 30 '18 at 14:41
  • 1
    If you use `FORK` clusters and don't modify the data, I think you don't make any copy. – F. Privé May 30 '18 at 15:51
  • You can pass essential columns to each function so that a copy of the data is avoided. You could try using the `{` in `j` in a data.table, if that helps – Ameya Mar 25 '22 at 01:17

0 Answers0