0

I am trying to compute the Gower's distance between observations of a single dataset. I found a useful function at the following link: https://www.kaggle.com/olivermeyfarth/parallel-computation-of-gower-distance?scriptVersionId=46656 which contains the code below to use parallel computation.

The code is below:

computeDistance <- function(dt1, dt2, nThreads = 4) {
  # Determine chunk-size to be processed by different threads
  s <- floor(nrow(dt1) / nThreads)

  # Setup multi-threading
  modelRunner <- makeCluster(nThreads)
  registerDoParallel(modelRunner)

  # For numeric variables, build ranges (max-min) to be used in gower-distance.
  # Ensure that the ranges is computed on the overall data and not on
  # the chunks in the parallel threads. Also, note that function 'gower.dist()'
  # seems to be buggy in regards to missing values (NA), which can be fixed by
  # providing ranges for all numeric variables in the function-call

  dt <- rbind(dt1, dt2)
  rngs <- rep(NA, ncol(dt))
  for (i in 1:ncol(dt)) {
   col <- dt[[i]]
   if (is.numeric(col)) {
     rngs[i] <- max(col, na.rm = T) - min(col, na.rm = T)
   }
  }

  # Compute distance in parallel threads; note that you have to include packages
  # which must be available in the different threads
  distanceMatrix <-
    foreach(
      i = 1:nThreads, .packages = c("StatMatch"), .combine = "rbind",
      .export = "computeDistance", .inorder = TRUE
    ) %dopar% {
      # Compute chunks
      from <- (i - 1) * s + 1
      to <- i * s
      if (i == nThreads) {
        to <- nrow(dt1)
      }

      # Compute distance-matrix for each chunk
      # distanceMatrix <- daisy(dt1[from:to,],metric = "gower")
      distanceMatrix <- gower.dist(dt1[from:to,], dt2, rngs = rngs)
    }

  # Clean-up
  stopCluster(modelRunner)
  return(distanceMatrix)
}    

However, when I try to run the code with my dataset (data below is just a simple example) as follows:

Distance_data <- data.frame(cbind(c(4,234,6,1),c(4,1,6,4),c(3,75,23,1)))
distances <- computeDistance(Distance_data,Distance_data)  

I receive the following error in R:

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : 
worker initialization failed: package ‘clue’ could not be loaded

I have tried adding the clue package to the .packages parameter in the foreach function, but that was unsuccessful. Any help is appreciated!

Note: you will need to have the following packages downloaded in order to run the function

library(StatMatch)
library(doParallel) 
AyeTown
  • 831
  • 1
  • 5
  • 20
  • 1
    You would gain more by rewriting this using Rcpp than by trying to parallelize it. – F. Privé Aug 13 '18 at 19:44
  • @ACE I did not have an issue with your code by the way - it ran successfully for me – CPak Aug 13 '18 at 20:27
  • @CPak It seems to be working fine on my local machine but I get that error when I run the code on R Studio Server – AyeTown Aug 14 '18 at 14:27
  • Are you certain that `StatMatch` has been installed on R Studio Server? – CPak Aug 14 '18 at 15:12
  • I was able to find the solution at the following link: https://stackoverflow.com/questions/31305858/r-foreach-loop-package-load-fails – AyeTown Aug 14 '18 at 15:33
  • Yes, it was installed on the server. I just had to tell the workers to use the same libPath as the master. – AyeTown Aug 14 '18 at 15:34

0 Answers0