1

Let’s say I have two large data.tables and need to combine their columns pairwise using the & operation. The combinations are dictated by grid (combine dt1 column1 with dt2 column2, etc.)

Right now I'm using a mclapply loop and the script takes hours when I run the full dataset. I tried converting the data to a matrix and using a vectorized approach but that took even longer. Is there a faster and/or more elegant way to do this?

mx1 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx1 <- as.data.table(mx1)
colnames(mx1) <- LETTERS[1:10]

mx2 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx2 <- as.data.table(mx2)
colnames(mx2) <- letters[1:10]

grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # the combinations I want to evaluate

out <- new_layer <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) { # <--- mclapply loop
    mx1[[col1]] & mx2[[col2]]
  }, SIMPLIFY = F)

setDT(out) # convert output into data table
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")

For context, this data is from a gene expression matrix where 1 row = 1 cell

Jeff Bezos
  • 1,929
  • 13
  • 23

2 Answers2

0

This can be done directly with no mapply: Just ensure that the with argument is FALSE ie:

 mx1[, grid$col1, with = FALSE] & mx2[, grid$col2, with=FALSE]
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

After some digging around I found a package called bit that is specifically designed for fast boolean operations. Converting each column of my data.table from logical to bit gave me a 100-fold increase in compute speed.

# Load libraries.
library(data.table)
library(bit)

# Create data set.
mx1 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx1) <- LETTERS[1:10]

mx2 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx2) <- letters[1:10]

grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # combinations I want to evaluate

# Single operation with logical matrix.
system.time({
  out <- mx1[, grid$col1] & mx2[, grid$col2]
}) # 26.014s

# Loop with logical matrix.
system.time({
  out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
    mx1[, col1] & mx2[, col2]
  })
}) # 31.914s

# Single operation with logical data.table.
mx1.dt <- as.data.table(mx1)
mx2.dt <- as.data.table(mx2)
system.time({
  out <- mx1.dt[, grid$col1, with = F] & mx2.dt[, grid$col2, with = F] # 26.014s
}) # 32.349s

# Loop with logical data.table.
system.time({
  out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
    mx1.dt[[col1]] & mx2.dt[[col2]]
  })
}) # 15.031s <---- SECOND FASTEST TIME, ~2X IMPROVEMENT

# Loop with bit data.table.
mx1.bit <- mx1.dt[, lapply(.SD, as.bit)]
mx2.bit <- mx2.dt[, lapply(.SD, as.bit)]
system.time({
  out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
    mx1.bit[[col1]] & mx2.bit[[col2]]
  })
}) # 0.383s <---- FASTEST TIME, ~100X IMPROVEMENT

# Convert back to logical table.
out <- setDT(out)
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")
out <- out[, lapply(.SD, as.logical)]

There are also special functions like sum.bit and ri that you can use to aggregate data without converting it back to logical.

Jeff Bezos
  • 1,929
  • 13
  • 23