What is the fastest way to perform an exhaustive search in R

Question

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.

Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.

I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):

library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)

#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
  as_tibble()
outcome <- rnorm(100)
#create fake data for example

#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
                         type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
  #set seed for results consistency among different run
  set.seed(x)
  #subsample the features table by extracting a subset of columns
  subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
  subset_matrix <- fake_table[,subset_index]
  #assign the outcome as last column of the subset
  subset_matrix[,ncol(subset_matrix)+1] <- outcome
  #use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
  rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
  #create a data frame with as many columns as features in the subset and only one row 
  #with the Relieff weigths
  rf_df <- rf_weights %>%
    unname() %>%
    matrix(., ncol = length(.), byrow = TRUE) %>%
    as_tibble() %>%
    set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time

However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):

start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
                         type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
  #initalize the weight as 0
  current_weigth <- 0
  #for all element in result (i.e. repetitions of the subsampling process)
  for (el in 1:length(result)){
    #assign new weight accessing the table
    new_weigth <- result[[el]][[1,feat]]
    #skip is empty (i.e. the features is not present in the current subset)
    if(is_empty(new_weigth)){next}
    #if new weight is higher than current weight assign the value to current weight
    if (current_weigth < new_weigth){
      current_weigth <- new_weigth}}
  current_weigth
}
end_time <- Sys.time()
end_time - start_time

Did you consider switching from `dplyr` to [data.table](https://rdatatable.gitlab.io/data.table/)? [Here](https://h2oai.github.io/db-benchmark/) you can find a benchmark. — ismirsehregal, May 20 '21 at 08:26
@ismirsehregal, indeed I did, but I have to admit that I did not follow through since in the second snippet I do not use any dplyr features (except of course the `tibble`), and as such I thought that would not make much difference. Do you think it would ? — fednem, May 20 '21 at 13:03
I haven't looked at your code in detail yet. But I'm pretty sure there will be a faster data.table way. Maybe you can avoid the for-loop. — ismirsehregal, May 20 '21 at 18:51
Can you edit the data to make it scalable? Mainly, instead of having an example that takes forever to run, it's normally better to target 2 seconds and then have it easily scalable up based on a different variable assignment. — Cole, May 26 '21 at 00:58
@Cole, I have scaled it down and I have changed `test_matrix_reduced` to `fake_table`. It should work now. — fednem, May 26 '21 at 13:10

score 2 · Answer 1 · answered May 20 '21 at 13:38

2

If I understood what you are trying to do correctly, then the answer is simpler than you think.

Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature? if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:

allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)

answered May 20 '21 at 13:38

DS_UNI

2,600
2
11
22

Thank for your answer. I thought about that, but the point is that either using `.combine = bind_rows` or using `bind_rows` on the list resulting from the `foreach` loop take so much time that the overall speed is not really improved. – fednem May 20 '21 at 14:31
even with data.table? that's interesting. I thought it should be at least fairly faster than the inner for loop – DS_UNI May 21 '21 at 06:53
I'll try to implement your solution and see ! – fednem May 25 '21 at 07:58

Cole · Accepted Answer · 2021-05-27T01:37:11.597

This follows @DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:

library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)

nr = 50L
nc = 200L

## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr) 

## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)

## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))

st = Sys.time()
for (i in seq_len(n_reps)) {
  set.seed(i) 
  
  ## similar way to do data simulations as OP
  sub_cols = sample(seq_len(nc), nc_sample_size)
  sub_mat = cbind(mat[, sub_cols], outcome)
  rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
  
  ## assign back to pre-allocated result
  res[i, sub_cols] = rf_weights
}

## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st

The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.

This produces identical results. For 50 rows by 200 cols, on single core it was 0.7 s vs. 7.8 seconds. For 100 rows by 800 cols, it was 6.6 seconds vs. 72 seconds. — Cole, May 27 '21 at 01:36
I'm not sure about what you mean when you write "this produce identical results". — fednem, May 27 '21 at 07:41
The line `apply(res, 2,max)` is the same result as `feat_weight_foreach` other than my solution is a named vector. — Cole, May 27 '21 at 11:15

score 0 · Answer 3 · answered May 21 '21 at 07:57

This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.

The first chapter of Rcpp for everyone says:

Chapter 1 Suitable situations to use Rcpp R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.

Loop operations in which later iterations depend on previous iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.

Wickham's Advanced R has a good chapter on that topic too.

What is the fastest way to perform an exhaustive search in R

3 Answers3