3

I need to speed up the for loop through multithreading. I would like to use the libraries for this: foreach and doParallel. I used these packages before but only for processes where one result table was needed. I don't know how to use them to export multiple tables (here results tables). My problem is much more complex and requires exporting many result sets. Here, for simplicity, I use iris data.

library(randomForest)
library(caret)

results_class <- data.frame()
results_overall <- data.frame()

for(i in 1:50){
  trainIndex <- caret::createDataPartition(iris$Species, p = 0.5, list = FALSE)
  irisTrain <- iris[ trainIndex,]
  irisTest  <- iris[-trainIndex,]

  model <- randomForest(x = irisTrain[,c(1:4)], y = irisTrain[,5], importance = TRUE,
                        replace = TRUE, mtry = 4, ntree = 500, na.action=na.omit,
                        do.trace = 100, type = "classification")

  pred_test <- predict(model, irisTest[,c(1:4)])
  con.mat_test <- confusionMatrix(pred_test, irisTest[,5], mode ="everything")

  results_class <- rbind(results_class, con.mat_test[["byClass"]])
  results_overall <- rbind(results_overall, con.mat_test[["overall"]])

}
Nicolas
  • 117
  • 8
  • 1
    I know there is the argument `.multicombine` in conjunction with the single `.combine`. – Francesco Grossetti Apr 01 '20 at 13:19
  • 1
    First, make sure to update to foreach 1.5.0 (released 2020-03-30) because it makes both sequential and parallel processes to evaluate the foreach expression in a `local()` environment. This lowers the risk for mistakes/misunderstandings further, especially the "hope" that assignments done inside the loop end up outside - which they cannot and won't (and should not) – HenrikB Apr 01 '20 at 20:11
  • 1
    Second, see my blog post 'Parallelize a For-Loop by Rewriting it as an Lapply Call' (https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/) from 2019-01-11 on how to turn a for loop into a y <- lapply(...) call. Since a y <- foreach(...) %dopar% { ... } is effectively just another flavor of lapply(), the gist and the take-home messages in that blog post applies here too. – HenrikB Apr 01 '20 at 20:13

1 Answers1

2

As far as I know it's not easy (or even possible) to modify variables outside of the foreach loop, so what about storing multiple results in one nested tibble?

library(randomForest)
library(caret)
library(foreach)
library(doParallel)

# Set up parallel computing
cl <- makeCluster(detectCores(logical = TRUE))
registerDoParallel(cl)

res <- foreach(i = 1:50, .packages = c("caret", "randomForest"), .combine = rbind) %dopar% {
    trainIndex <- caret::createDataPartition(iris$Species, p = 0.5, list = FALSE)
    irisTrain <- iris[ trainIndex,]
    irisTest  <- iris[-trainIndex,]

    model <- randomForest(x = irisTrain[,c(1:4)], y = irisTrain[,5], importance = TRUE,
                          replace = TRUE, mtry = 4, ntree = 500, na.action=na.omit,
                          do.trace = 100, type = "classification")

    pred_test <- predict(model, irisTest[,c(1:4)])
    con.mat_test <- confusionMatrix(pred_test, irisTest[,5], mode ="everything")

    # Save class into separate variable
    # Use substr to get rid of "Class: "
    class <- data.frame(con.mat_test[["byClass"]])
    overall <- data.frame(con.mat_test[["overall"]])
    class$class <- sapply(rownames(class), function(x) substr(x, 8, nchar(x)))
    overall$class <- sapply(rownames(overall), function(x) substr(x, 8, nchar(x)))

    # Save output dataframe in tibble as list column
    return(tibble::tibble(iteration = i, 
                          class = list(class), 
                          overall = list(overall)))
}

# Stop the cluster
stopCluster(cl)
registerDoSEQ()

The output is then as follows:

> print(res)
# A tibble: 50 x 3
   iteration class              overall         
       <int> <list>             <list>          
 1         1 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 2         2 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 3         3 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 4         4 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 5         5 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 6         6 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 7         7 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 8         8 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
 9         9 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
10        10 <df[,12] [3 x 12]> <df[,2] [7 x 2]>
# ... with 40 more rows
koenniem
  • 506
  • 2
  • 10
  • 1
    Thank you. How can I convert these results to those tables that I originally wanted? – Nicolas Apr 01 '20 at 14:24
  • 1
    I don't know what result you're looking for exactly, but you can get a list of the tables by using a simple subset such as `res$class`. To get one table, you can use `tidyr::unnest(res, class)` or `dplyr::bind_rows(res$class)`. – koenniem Apr 01 '20 at 14:31
  • I checked it more closely and if we look at the dataframe after doing it: unnest_dataset <- tidyr :: unnest (res, class) We can also count and each iteration should have 3 results, however, after 4 iterations everything doubles: plyr::count(unnest_dataset$iteration) It turns out that there are 50 iterations but 1014. The results are starting to double. And we get the same iteration twice, then 4 times ... – Nicolas May 05 '20 at 23:08
  • 1
    Yes, you're right. That's because (for some reason) I added the result of each iteration to `results_overall` and `results_class`. Since this loop is spread over multiple workers, each results got stored with the result of the previous iteration for that worker. The solution is to simply not combine this. I have updated my answer to reflect this. – koenniem May 08 '20 at 11:55
  • It is possible to keep rownames, to know which accuracy is to which class ? – Nicolas May 10 '20 at 15:02
  • 1
    The row names are still there. Try `res$class[[1]]` and you'll see they are still there. The problem is in the way `unnest` binds the new rows the existing data, leaving out row names because they would otherwise duplicates. One solution is to save the class in the `class` and `overall` dataframes in a separate variable. I've updated my answer to reflect this. – koenniem May 13 '20 at 12:44