Fast distance calculation in R

Question

I'm trying to calculate the

1) Euclidean distance, and

2) Mahalanobis distance

for a set of matricies in r. I've been doing it as such:

v1 <- structure(c(0.508, 0.454, 0, 2.156, 0.468, 0.488, 0.682, 1, 1.832, 
            0.44, 0.928, 0.358, 1, 1.624, 0.484, 0.516, 0.378, 1, 1.512, 
            0.514, 0.492, 0.344, 0, 1.424, 0.508, 0.56, 0.36, 1, 1.384, 0.776, 
            1.888, 0.388, 0, 1.464, 0.952, 0.252, 0.498, 1, 1.484, 0.594, 
            0.256, 0.54, 2, 2.144, 0.402, 0.656, 2.202, 1, 1.696, 0.252), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

v2 <- structure(c(1.864, 1.864, 1.864, 1.864, 1.864, 1.6, 1.6, 1.6, 
            1.6, 1.6, 1.536, 1.536, 1.536, 1.536, 1.536, 1.384, 1.384, 1.384, 
            1.384, 1.384, 6.368, 6.368, 6.368, 6.368, 6.368, 2.792, 2.792, 
            2.792, 2.792, 2.792, 2.352, 2.352, 2.352, 2.352, 2.352, 2.624, 
            2.624, 2.624, 2.624, 2.624, 1.256, 1.256, 1.256, 1.256, 1.256, 
            1.224, 1.224, 1.224, 1.224, 1.224), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

L2 <- sqrt(rowSums((v1-v2)^2))  # Euclidean distance for each row

which provides:

[1] 7.132452 7.568359 7.536904 5.448696 7.163580

That's perfect! But I've heard you can also compute Euclidean/L2 distance using the following form:

I'd like to calculate my distance this way because the Mahalanobis distance is simply this and the covariance matrix. See this.

I haven't figured out how to code this in r, however. I've tried:

sqrt(crossprod((t(v1)-t(v2))))

and

sqrt((v1-v2) %*% t(v1-v2))

But they just don't give me what I want. Suggestions?

Note - I'm looking to do this as a single operation, not in a loop of any kind. It has to be very fast because I'm doing it over millions of rows multiple times. Maybe it's not possible. I'm open to changing the format of v1 and v2.

Nope, I need a distance for each row. Results must match `[1] 7.132452 7.568359 7.536904 5.448696 7.163580` — Zafar, May 23 '17 at 22:54

Marius · Accepted Answer · 2017-05-23T23:57:30.017

1

You need to apply the formula to each row individually, so something like:

> sapply(1:nrow(v1), function(i) {
+     q = v1[i, ] - v2[i, ]
+     d = sqrt(t(q) %*% q)
+     d
+ })
[1] 7.132452 7.568359 7.536904 5.448696 7.163580

If you need something faster you can always try the same thing in C++ (code adapted from here):

#include <Rcpp.h>

using namespace Rcpp;

double dist2(NumericVector x, NumericVector y){
    double d = sqrt( sum( pow(x - y, 2) ) );
    return d;
}

// [[Rcpp::export]]
NumericVector calc_l2 (NumericMatrix x, NumericMatrix y){
    int out_length = x.nrow();
    NumericVector out(out_length);

    for (int i = 0 ; i < out_length; i++){
        NumericVector v1 = x.row(i);
        NumericVector v2 = y.row(i);
        double d = dist2(v1, v2);
        out(i) = d;
    }
    return (out) ;
}

Running in R:

library(Rcpp)

sourceCpp("calc_L2.cpp")
calc_l2(v1, v2)

edited May 23 '17 at 23:57

answered May 23 '17 at 23:04

Marius

58,213
16
107
105

I have millions of rows in my dataset so this won't be fast enough. – Zafar May 23 '17 at 23:06
2

If you have constraints and requirements it would probably be a good idea to tell us about them in your question. – takendarkk May 23 '17 at 23:09
Dan, look up the data.table package. This will help you when dealing with such a large dataset. You will find a lot of information on using the package on this site. – Evan Friedland May 23 '17 at 23:18
@Dan: you can apply basically the same code in C++ for a speedup. – Marius May 23 '17 at 23:58
Very cool with the `Rcpp` action. I've implemented to see if I get a speed-up with L1 and unfortunately the R vectorization of `sqrt(rowSums((v1-v2)^2))` is faster than the `for` loop in C++. That leaves two points: 1) Is there a way to translate `sqrt(rowSums((v1-v2)^2))` to C++? so as to avoid the `for` loop? and 2) How will I implement the `Mahalanobis` distance using `C++`? I'll post my best shot at this in `C++` above. – Zafar May 24 '17 at 05:42
OK I couldn't figure out how to implement it so as just to take args v1 and v2 – Zafar May 24 '17 at 17:12
I ended up figuring it out, but R was actually faster. Still haven't done Mahalanobis though. That will necessarily be in RCPP – Zafar Jun 04 '17 at 20:35

Fast distance calculation in R

1 Answers1