1

I would like to speed up a distance calculation. I have already put effort into parallelizing it. Unfortunately it still takes longer than an hour.

Basically, the distance between a vector i and j is computed via manhattan distance. The distances between possible values of the vectors is given in the matrix Vardist. Vardist[i[1],j[1]] is the distance between the two values i[1] and j[1]. (the matrix is indexed by characters in i[1] and j[1] respectively)

There is one more important addition for the distance computation. The distance between vector i and j is the minimum over all manhattan distances between vector i and any possible permutation of vector j. This makes it computationally heavy the way it is programmed.

I have 1000 objects to compare with another. Furthermore each object is a vector of length 5. So there will be 120 permutations for each vector.

    distMatrix <- foreach(i = 1:samplesize,
      .combine = cbind,
      .options.snow=opts,
      .packages = c("combinat"))  %dopar%
      {
      # inititalizing matrix
      dist <- rep(0,samplesize)
      # get values on customer i
      ValuesCi <- as.matrix(recodedData[i,])
      # Remove unecessary entries in value distance matrix
      mVardist <- Vardist[ValuesCi,]

      for(j in i:samplesize){
        # distance between vector i and all permutations of vector j is computed
        # minimum of above all distances is taken as distance between vector i and j
        dist[j] <- min(unlist(permn(recodedData[j,], 
                   function(x){ pdist <- 0
                              #nvariables is length of each vector
                              for(i in 1:nvariables){
                              pdist <- pdist + mVardist[i,as.matrix(x)[i]]
                              }
                              return(pdist)}   )))


      }
      dist
      }

Any tips or suggestions are greatly appreciated!

  • How long are your vectors? Probably the next step is to use `Rcpp` for finding the minimum distance among all permutations of `j`... – Gregor Thomas Mar 19 '17 at 16:21
  • Profiling will help you direct your optimisation efforts. For instance using the `profvis` package (embedded in RStudio if you're using it) – Aurèle Mar 19 '17 at 16:27
  • I notice that you use i both in your outer `foreach` loop and also in your anonymous function. This is probably not a good idea. – G5W Mar 19 '17 at 16:32

1 Answers1

1

Oh yes, this code is going to take a while. The basic reason is that you use explicit indexing. Even paralellizing will not help.

Okay, there are several option which you can use.

(1) use base::dist; give it a matrix and it will compute distances between the rows in the matrix.

(2) use some clustering packages, e.g. flexClust, that has some other options.

(3) If you need to compute distances between rows of a matix with rows of some other matrix, you can vectorize the code, e.g. euclidean distance:

function(xmat, ymat) {
  t(apply(xmat, 1, function(x) {
    sqrt(colSums((t(ymat) - x)^2))
  }))
}

(4) use C++ and Rcpp to make use of the BLAS functionality and you may even consider parallelizing code using RcppParallel (distance matrix example)

When you have fast routines for middle-sized data, then you may go into distributing it to clusters ... for large data.

Drey
  • 3,314
  • 2
  • 21
  • 26