1

I have a dataframe with 40 rows and ~40000 columns. The 40 rows are split into group "A" and group "B" (20 each). For each column, I would like to apply a statistical test (wilcox.test()) comparing the two groups. I started using a for loop to run through the 40000 columns but it was very slow.

Minimal Reproducible Example (MRE):

library(tidyverse)

set.seed(123)

metrics <- paste("metric_", 1:40000, sep = "")
patient_IDs <- paste("patientID_", 1:40, sep = "")
m <- matrix(sample(1:20, 1600000, replace = TRUE), ncol = 40000, nrow = 40,
            dimnames=list(patient_IDs, metrics))
test_data <- as.data.frame(m)
test_data$group <- c(rep("A", 20), rep("B", 20))

# Collate list of metrics to analyse ("check") for significance
list_to_check <- colnames(test_data)[1:40000]

Original 'loop' method (this is what I want to vectorise):

# Create a variable to store the results
results_A_vs_B <- c()

# Loop through the "list_to_check" and,
# for each 'value', compare group A with group B
# and load the results into the "results_A_vs_B" variable
for (i in list_to_check) {
  outcome <- wilcox.test(test_data[test_data$group == "A", ][[i]],
                         test_data[test_data$group == "B", ][[i]],
                         exact = FALSE)
  if (!is.nan(outcome$p.value) && outcome$p.value <= 0.05) {
    results_A_vs_B[i] <- paste(outcome$p.value, sep = "\t")
  }
}

# Format the results into a dataframe
summarised_results_A_vs_B <- as.data.frame(results_A_vs_B) %>%
  rownames_to_column(var = "A vs B") %>%
  rename("Wilcox Test P-value" = "results_A_vs_B")

Benchmarking the answers so far:

# Ronak Shah's "Map" approach
Map_func <- function(dataset, list_to_check) {
  tmp <- split(dataset[list_to_check], dataset$group)
  stack(Map(function(x, y) wilcox.test(x, y, exact = FALSE)$p.value, tmp[[1]], tmp[[2]]))
}

# @Onyambu's data.table method
dt_func <- function(dataset, list_to_check) {
  melt(setDT(dataset), measure.vars = list_to_check)[, dcast(.SD, rowid(group) + variable ~ group)][, wilcox.test(A, B, exact = FALSE)$p.value, variable]
}

# @Park's dplyr method (with some minor tweaks)
dplyr_func <- function(dataset, list_to_check){
  dataset %>%
    summarise(across(all_of(list_to_check),
                     ~ wilcox.test(.x ~ group, exact = FALSE)$p.value)) %>%
    pivot_longer(cols = everything(),
                 names_to = "Metrics",
                 values_to = "Wilcox Test P-value")
}

library(microbenchmark)
res_map <- microbenchmark(Map_func(test_data, list_to_check), times = 10)
res_dplyr <- microbenchmark(dplyr_func(test_data, list_to_check), times = 2)

library(data.table)
res_dt <- microbenchmark(dt_func(test_data, list_to_check), times = 10)

autoplot(rbind(res_map, res_dt, res_dplyr))

image_1.png

# Excluding dplyr
autoplot(rbind(res_map, res_dt))

image_2.png

--

Running the code on a server took a couple of seconds longer but the difference between Map and data.table was more pronounced (laptop = 4 cores, server = 8 cores):

autoplot(rbind(res_map, res_dt))

image_3.png

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • 37 thousand is a small dataset that R would compute in less than a second. You won't feel the difference – Onyambu Aug 26 '21 at 03:58
  • Why would it take 15 seconds?? Should take a second! – Onyambu Aug 26 '21 at 04:14
  • what is the dimension of your dataset?. I simulated (40, 40000) dataset and with the map approach, it computes immediately – Onyambu Aug 26 '21 at 04:18
  • it takes 0 seconds on my end. I am curious as to why it should take that long on your end. – Onyambu Aug 26 '21 at 04:21
  • Thanks for your interest @Onyambu - I have updated my question with a better MRE and hopefully this will make it easier to figure out the best/fastest approach to take – jared_mamrot Aug 26 '21 at 06:24
  • 1
    could you try `melt(setDT(dataset), measure.vars = list_to_check)[, dcast(.SD, rowid(group) + variable~group)][, wilcox.test(A, B, exact = FALSE)$p.value, variable]` and see whether there would be any difference – Onyambu Aug 26 '21 at 16:27
  • Thanks @Onyambu - I updated the benchmarking and also checked it on a different system - data.table is faster :) (Can you please turn your comment into an answer?) – jared_mamrot Aug 27 '21 at 01:02

2 Answers2

4

Here is another option -

Map_approach <- function(dataset, list_to_check) {
  tmp <- split(dataset[list_to_check], dataset$group)
  stack(Map(function(x, y) wilcox.test(x, y)$p.value, tmp[[1]], tmp[[2]]))
}

Map_approach(data_subset, list_to_check)

#        values     ind
#1 5.359791e-05 value_1
#2 5.499685e-08 value_2
#3 1.503951e-06 value_3
#4 6.179352e-08 value_4
#5 5.885650e-08 value_5

Testing it on larger sample Map is slightly faster than the for loop.

n <- 1e6
data_subset <- data.frame(patient_ID = 1:n,
                          group = c(rep("A", n/2),
                                    rep("B", n/2)),
                          value_1 = c(sample(1:10, n/2, replace = TRUE),
                                      sample(5:15, n/2, replace = TRUE)),
                          value_2 = c(sample(1:5, n/2, replace = TRUE),
                                      sample(15:n/2, n/2, replace = TRUE)),
                          value_3 = c(sample(1:12, n/2, replace = TRUE),
                                      sample(8:17, n/2, replace = TRUE)),
                          value_4 = c(sample(5:10, n/2, replace = TRUE),
                                      sample(15:25, n/2, replace = TRUE)),
                          value_5 = c(sample(20:40, n/2, replace = TRUE),
                                      sample(10:15, n/2, replace = TRUE)))



microbenchmark::microbenchmark(loop = wilcox_loop(data_subset, list_to_check), 
                               Map = Map_approach(data_subset, list_to_check))

#Unit: seconds
# expr      min       lq     mean   median       uq      max neval cld
# loop 5.254573 5.631162 5.788624 5.734480 5.920424 6.756319   100   b
#  Map 4.710790 5.084783 5.201711 5.160722 5.309048 5.721540   100  a 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

May you try this code? It's slightly faster in my computer.

wilcox_loop2 <- function(data_subset, list_to_check){
  A = data_subset[data_subset$group == "A",]
  B = data_subset[data_subset$group == "B",]
  outcome <- sapply(list_to_check, function(x) wilcox.test(A[[x]],
                                                           B[[x]],
                                                           exact = FALSE)$p.value)
  as.data.frame(outcome) %>%
    rownames_to_column(var = "A vs B") %>%
    rename("Wilcox Test P-value" = "outcome")

}

I'm not sure it's OK to split data into A and B...

My system time costs is like

microbenchmark::microbenchmark(origin = wilcox_loop(data_subset, list_to_check),
                               test = wilcox_loop2(data_subset, list_to_check))

Unit: milliseconds
   expr      min       lq     mean   median       uq     max neval cld
 origin 4.815601 5.006951 6.490757 5.385502 6.790752 21.5876   100   b
   test 3.817801 4.116151 5.146963 4.330500 4.870651 15.8271   100  a 
Park
  • 14,771
  • 6
  • 10
  • 29