18

currently I'm using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method="manhattan")

This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)

Searching google and this forum did not succeed.

Does anybody has an idea?

Konrad
  • 17,740
  • 16
  • 106
  • 167
Vespasian
  • 605
  • 1
  • 5
  • 14
  • could you provide an example `featureVector`? – Arun Jun 16 '13 at 22:24
  • feature vector ist simply a data.frame with 100 columns and about 2000 rows. The columns are the values of the single dimensions – Vespasian Jun 16 '13 at 22:33
  • 1
    takes ~0.05 sec on my machine, how about you? I'm thinking anything parallel might have a longer overhead. And just to make sure, your expected output is a 100-by-100 matrix, right? – flodel Jun 16 '13 at 22:43
  • ok, I now realize you are looking for a 2000-by-2000 matrix. That one is taking 0.6 sec on my machine, so parallelization might be an option. – flodel Jun 16 '13 at 22:51
  • my expected output is a 2000x2000 matrix. So it will need to do 2000*2000*100=400 000 000 operations. right?. If there is such a discrepancy in speed there is probably an error in my code although i cannot see it at the moment – Vespasian Jun 16 '13 at 23:05
  • You should provide complete R code tout create your input data ans output of system.time call to make it easier to reproduce and understand your problem. – Karl Forner Jun 28 '13 at 11:10
  • reproducible example with benchmark please... – Karl Forner Jul 03 '13 at 09:12

6 Answers6

21

The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.

noleto
  • 1,534
  • 16
  • 12
Zhilong Jia
  • 2,329
  • 1
  • 22
  • 34
  • 3
    R function, amap::Dist function is the version of dist by Multi-thread (parallelisation). I believe it's the best answer! ref: http://www.inside-r.org/packages/cran/amap/docs/Dist – Zhilong Jia Nov 23 '14 at 14:25
  • 2
    I totally agree, this is the best answer! – WAF Apr 07 '15 at 11:50
  • Thank you for this answer. I can't, however, figure out whether amap's `hcluster()` will work with a distance matrix, or does it absolutely require raw data? – Maxim.K Jun 02 '15 at 08:53
  • @Maxim.K from the manual, it only accept the raw data but not the dist data, and the distance matrix should be paralleled calculated internally after you set the nbproc and method parameter. Ref http://www.inside-r.org/packages/cran/amap/docs/hcluster – Zhilong Jia Jun 02 '15 at 13:13
  • Note that per the documentation, this package doesn't parallelize on Windows – Omri374 Aug 24 '15 at 12:45
4

Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)

vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

TaxiDistFun <- function(one.vec, whole.matrix) {
    diff.matrix <- t(t(whole.matrix) - one.vec)
    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
    return(this.row)
}

cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))

system.time(dist.array <- parRapply(cl, vec.array,
                        function(x) TaxiDistFun(x, vec.array)))

stopCluster(cl)

dim(dist.array) <- c(2000, 2000)
Cyril Durand
  • 15,834
  • 5
  • 54
  • 62
Will Beason
  • 3,417
  • 2
  • 28
  • 46
4

You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

library(parallelDist)
library(amap)
library(wordspace)
library(microbenchmark)

set.seed(123)
x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

microbenchmark(parDist(x, method = "manhattan"),
               Dist(x, method = "manhattan", nbproc = 8),
               dist.matrix(x, method = "manhattan"),
               times = 10)

Unit: milliseconds
                                      expr      min       lq     mean   median       uq      max neval
          parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10
 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10
      dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10

With a larger matrix:

x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
microbenchmark(parDist(x, method = "manhattan"),
+                Dist(x, method = "manhattan", nbproc = 8),
+                dist.matrix(x, method = "manhattan"),
+                times = 10)
Unit: seconds
                                      expr       min        lq      mean    median        uq       max neval
          parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10
 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10
      dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10

Further performance comparisons can be found in parallelDist's vignette.

Aeck
  • 543
  • 7
  • 11
2

I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the dist.matrix function. You can try this code:

X <- data.frame(replicate(1000,sample(0:1,5000,rep=TRUE)))
system.time(d <- dist(X, method = "manhattan"))
system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))

As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with dist.matrix

These are the results in my laptop (i7-6500U):

> system.time(d <- dist(X, method = "manhattan"))
   user  system elapsed 
 151.79    0.04  152.59 
> system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
   user  system elapsed 
  19.19    0.22   19.56 

This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html

It doesn´t solve it in parallel but is enough in many occasions.

  • dist.matrix is indeed very fast, but does not work with NA's. Is there a solution which also handles NA's? – Omry Atia May 02 '18 at 08:20
1

I am also working with somewhat large distance matrices and trying to speed-up the computation. Will Benson above is likely to be correct when he says that "the time to start up the function and export the variables to the cluster would probably be longer than just using".

However, I think this applies to distance matrices with small to moderate size. See the example bellow using the functions Dist from the package amap with 10 processors, dist from the package stats, and rdist from package fields, which calls a Fortran function. The first example creates a 400 x 400 distance matrix. The second creates a 3103 x 3103 distance matrix.

require(sp)
require(fields)
require(amap)
data(meuse.grid)
meuse.gridA <- meuse.grid[1:400, 1:2]
meuse.gridB <- meuse.grid[, 1:2]

# small distance matrix
a <- Sys.time()
invisible(dist(meuse.gridA, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.002138376 secs
a <- Sys.time()
invisible(Dist(meuse.gridA, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.005409241 secs
a <- Sys.time()
invisible(rdist(meuse.gridA))
Sys.time() - a
Time difference of 0.02312016 secs

# large distance matrix
a <- Sys.time()
invisible(dist(meuse.gridB, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.09845328 secs
a <- Sys.time()
invisible(Dist(meuse.gridB, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.05900002 secs
a <- Sys.time()
invisible(rdist(meuse.gridB))
Sys.time() - a
Time difference of 0.8928168 secs

Note how the computation time reduced from 0.09845328 secs to 0.05900002 secs using Dist compared to dist when the distance matrix was large (3103 x 3103). As such, I would suggest that you use function Dist from the amap package provided you have several processors available.

Samuel-Rosa
  • 339
  • 3
  • 10
0

I've found parallelDist to be orders of magnitude faster than dist, and chewing up much less virtual memory in the process, on my Mac under Microsoft R Open 3.4.0. A word of warning though - I've had no luck compiling it on R 3.3.3. It doesn't list the version of R as a dependency but I suspect it is.

Nick
  • 51
  • 2