8

I have two huge matrices with equal dimensions. I want to calculate Euclidean distance between them. I know this is the function:

euclidean_distance <- function(p,q){
  sqrt(sum((p - q)^2))
}

and if these are two matrices:


set.seed(123)
    mat1 <- data.frame(x=sample(1:10000,3), 
                       y=sample(1:10000,3), 
                       z=sample(1:10000,3))
    mat2 <- data.frame(x=sample(1:100,3), 
                       y=sample(1:100,3), 
                       z=sample(1:1000,3))

then I need the answer be a new matrix 3*3 showing Euclidean distance between each pair of values of mat1 and mat2.

any suggestion please?

zara
  • 1,048
  • 3
  • 9
  • 19

3 Answers3

11

This is a job for the base function outer:

outer(mat1,mat2,Vectorize(euclidean_distance))
         x         y         z
x  9220.40  9260.736  8866.034
y 12806.35 12820.086 12121.927
z 11630.86 11665.869 11155.823
A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • This is giving error _Error in formals(FUN) : object 'euclidean_distance' not found_ – Sim101011 Sep 28 '17 at 11:33
  • 3
    See the OP's function named euclidean_distance(). It is not a built-in R function. – user5054 Feb 21 '18 at 21:01
  • The method in this answer calculates the distance between columns and not rows. And if it's used with a matrix or a transposed dataframe, then it produces a 4-dimensional array. To calculate the distance between rows, you can convert the rows of each input matrix to a list of vectors: `x=matrix(1:12,4);y=matrix(1:9,3);outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2))))`. But in my benchmark, that was about a hundred times slower than `sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y))`. – nisetama Aug 04 '22 at 15:05
9

You can use the package pdist:

library(pdist)
dists <- pdist(t(mat1), t(mat2))
as.matrix(dists)
         [,1]      [,2]      [,3]
[1,]  9220.40  9260.735  8866.033
[2,] 12806.35 12820.086 12121.927
[3,] 11630.86 11665.869 11155.823

this will give you all Euclidean distances of the pairs: (mat1$x,mat2$x), (mat1$x,mat2$y),..., (mat1$z,mat2$z)

J.R.
  • 3,838
  • 1
  • 21
  • 25
  • is this way calculates Euclidean distance among pairs? – zara Jan 30 '16 at 20:39
  • yes, this amounts to applying your function, `euclidean_distance()` to all the pairs. – J.R. Jan 30 '16 at 20:41
  • I tried to install that library by install.packages(pdist) but it gives error:Error in install.packages : object 'pdist' not found. how can I install this library? – zara Jan 30 '16 at 20:52
  • 1
    maybe `install.packages("pdist")`? – J.R. Jan 30 '16 at 20:52
  • hmm.. so this is the distance between each row/column of the matrices? Is it possible to get a single value as the distance between two matrices? – mindlessgreen Mar 29 '17 at 11:01
  • depends on what you mean by the "distance" between two matrices, could be as simple as `dist(A-B)` for L2 (Euclidean) – J.R. Mar 29 '17 at 12:34
  • @J.R. I tried the solution here, it's more like the diffence among an element of Mat1 to Mat2. and when I alter the matrix(or data.frame) to vector, it gives a certain value of distance. (just feed back to your solution), thx. – Grec001 Dec 17 '18 at 10:23
  • 1
    @Grec001 yes, it is the pairwise distance (L2) between each pair of "obervations". Or as stated in the documentation: "Compute distance matrix between two matrices of observations, or two subsets of one matrix" – J.R. Dec 17 '18 at 16:25
1
library(Rcpp)
library(microbenchmark)

cppFunction('NumericMatrix crossdist(NumericMatrix x,NumericMatrix y){
  int n1=x.nrow(),n2=y.nrow(),ncol=x.ncol(),i,j,k;
  if(ncol!=y.ncol())throw std::runtime_error("Different column number");
  NumericMatrix out(n1,n2);
  for(i=0;i<n1;i++)
    for(j=0;j<n2;j++){
      double sum=0;
      for(k=0;k<ncol;k++)sum+=pow(x(i,k)-y(j,k),2);
      out(i,j)=sqrt(sum);
    }
  return out;
}')

cppFunction('NumericMatrix crossdist2(NumericMatrix x,NumericMatrix y){
  int n1=x.nrow(),n2=y.nrow(),ncol=x.ncol(),i,j,k;
  if(ncol!=y.ncol())throw std::runtime_error("Different column number");
  NumericMatrix out(n1,n2);
  double rs1[n1],rs2[n2],sum;
  for(i=0;i<n1;i++){sum=0;for(j=0;j<ncol;j++)sum+=pow(x(i,j),2);rs1[i]=sum;}
  for(i=0;i<n2;i++){sum=0;for(j=0;j<ncol;j++)sum+=pow(y(i,j),2);rs2[i]=sum;}
  for(i=0;i<n1;i++)for(j=0;j<n2;j++){
    sum=0;
    for(k=0;k<ncol;k++)sum+=x(i,k)*y(j,k);
    out(i,j)=sqrt(rs1[i]+rs2[j]-2*sum);
  }
  return out;
}')

x=matrix(rnorm(2e4),,10)
y=matrix(rnorm(1e4),,10)

b=microbenchmark(times=100,
  crossdist(x,y),
  crossdist2(x,y),
  Rfast::dista(x,y),
  proxy::dist(x,y),
  pracma::distmat(x,y),
  as.matrix(pdist::pdist(x,y)),
  sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*tcrossprod(x,y)),
  sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y)),
  sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*x%*%t(y)),
  sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*Rfast::Tcrossprod(x,y)),
  outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2))))
)

a=aggregate(b$time,list(b$expr),median)
a=a[order(a[,2]),]
writeLines(paste(sprintf("%.3f",a[,2]/min(a[,2])),gsub(" ","",a[,1])))

Result:

1.000 crossdist(x,y)
1.054 crossdist2(x,y)
1.217 sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*Rfast::Tcrossprod(x,y))
1.227 sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*x%*%t(y))
1.897 Rfast::dista(x,y)
1.946 sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*tcrossprod(x,y))
1.950 sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y))
2.004 proxy::dist(x,y)
2.402 as.matrix(pdist::pdist(x,y))
3.674 pracma::distmat(x,y)
177.474 outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2))))

tcrossprod(m1,m2) is supposed to be a slightly faster alternative to m1%*%t(m2), even though it was about as fast in this benchmark:

> m1=matrix(rnorm(2e4),,10);m2=matrix(rnorm(1e4),,10)
> microbenchmark(times=1000,tcrossprod(m1,m2),m1%*%t(m2),Rfast::Tcrossprod(m1,m2))
                      expr      min       lq     mean   median       uq
        tcrossprod(m1, m2) 12.28305 13.06046 17.58402 17.60379 17.74104
              m1 %*% t(m2) 12.79996 17.30764 17.52570 17.59473 17.70758
 Rfast::Tcrossprod(m1, m2) 11.48939 13.81658 17.68059 17.23675 17.37447

This is a fast way to calculate the distance of row 1 in m1 to row 1 in m2, row 2 in m1 to row 2 in m2, and so on:

sqrt(rowSums((m1-m2)^2))

This is a fast way to calculate the distance of a vector v to each row of matrix m:

sqrt(rowSums(m^2)+sum(v^2)-2*(m%*%as.matrix(v))[,1])
nisetama
  • 7,764
  • 1
  • 34
  • 21
  • 1
    ```Rfast::Tcrossprod``` needs more than 500*500 size to play fast. – Manos Papadakis Aug 08 '22 at 17:02
  • @ManosPapadakis Is it a feature or a bug that the matrix returned by `Rfast::Outer` is a transposed version of the regular `outer` function? – nisetama Aug 08 '22 at 17:17
  • No it is not a bug. My partner want it to be called ```Rfast::Outer(y,x)```. So if you want the same result, just flip the arguments and it will play correct. – Manos Papadakis Aug 11 '22 at 13:49