2

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).

The calculation follows the formula:

DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])

I specifically need to multiply each parcel of the somatory by a specific weight (lambda).

The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.

I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.

Can some one help me reduce my processing time with this code?

More information about the code and the structure of the data can be found in the code provided bellow as comments.

# Data frames used to calculate the euclidean distances between each observation 
#   from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting 
#   in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]

 

# Weights used for each of the 50 variables to calculate the weighted 
#   euclidean distance.
# Can be a vector of different weights or a scalar of the same weight 
#   for all variables.
lambda <- runif(n=50, min=0, max=10)   ## length(lambda) > 1
# lambda=1   ## length(lambda) == 1

if (length(lambda) > 1) {
  as.numeric(unlist(lambda))
  lambda <- as.matrix(lambda)
  lambda <- t(lambda)
}

nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2) 

 

# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )  
for (m in 1:nrows1) {
  for (i in 1:nrows2) {
    if (length(lambda) == 1) { 
      DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2) 
    }
    if (length(lambda) > 1){ 
      DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
    }
    next
  }
  next
}

After all the sugestions, combining the answers from @MDWITT (for length(lambda > 1) and @F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:

#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]

#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10)   ##length(lambda) > 1
lambda = 1   ##length(lambda) == 1

nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2) 

#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)  

if (length(lambda) > 1){
  as.numeric(unlist(lambda))
  lambda <- as.matrix(lambda)
  lambda <- t(lambda)

  library(Rcpp)
  cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){

              int n_x = x.nrow();
              int n_y = y.nrow();


              NumericMatrix DIST(n_x, n_y);

              //begin the loop

              for (int i = 0 ; i < n_x; i++){
              for (int j = 0  ; j < n_y ; j ++) {
              double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
              DIST(i,j) = d;
              }
              }
              return (DIST) ;
  }')

    DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}


  if (length(lambda) == 1) { 
    DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
  }
Ana F.
  • 43
  • 6
  • Probably won't go much faster, but why storing lambda in a single row matrix. Just store in a vector and avoid subsetting `lambda[1,]`. Also, he `next` statements in your code are useless – digEmAll May 09 '19 at 19:41
  • might be worth checking this out as well it would be a move to Rcpp but you would need to add the lamba – MDEWITT May 09 '19 at 19:42
  • Then, I would completely remove the two `if` statements by setting `lambda = 1` when `length(lambda)==1`, so you can use a one-line formula directly in the 2nd loop – digEmAll May 09 '19 at 19:43
  • Indices and sizes are wrong in your code. Please fix it. – F. Privé May 10 '19 at 06:11
  • @MDEWITT I'll look in to that! Thanks! – Ana F. May 10 '19 at 11:56
  • @f-privé You are right, the size of the resulting matrix is wrong. I'm correcting it now. – Ana F. May 10 '19 at 12:24

2 Answers2

4

Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.

If you don't have lambda, this is just

outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)

With lambda, it becomes

outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
    tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • The lambda solution results in ```Error in DATA1^2 %*% lambda : non-conformable arguments```. The non-lambda solution is 50 times faster than OP loop. – Cole May 10 '19 at 11:01
  • @Cole is right. The lambda solution results in Error in DATA1^2 %*% lambda : non-conformable arguments, but the other line works correctly. – Ana F. May 10 '19 at 12:26
  • Using `lambda <- runif(n=50, min=0, max=10)`, I don't get any error. – F. Privé May 10 '19 at 20:05
1

Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have

#include <Rcpp.h>
#include <cmath>

using namespace Rcpp;

// [[Rcpp::export]]

NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){

  int n_x = x.nrow();
  int n_y = y.nrow();


  NumericMatrix out(n_x, n_y);

  //begin the loop

  for (int i = 0 ; i < n_x; i++){
    for (int j = 0  ; j < n_y ; j ++) {
      double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
      out(i,j) = d;
    }
  }
  return (out) ;
}

In R, then I have

library(Rcpp)
sourceCpp("libs/euclidean.cpp")

# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)

# Run the program

out <- weighted_distance(DATA1, DATA2, lambda = lambda)

When I test the speed using:

microbenchmark(
  Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)  
for (m in 1:nrows1) {
  for (i in 1:nrows2) {
    if (length(lambda) == 1) { 
      DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2) 
    }
    if (length(lambda) > 1){ 
      DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
    }
    next
  }
  next
}}, times = 100)

You can see that it is a good clip faster:

Unit: microseconds
     expr       min        lq       mean    median         uq        max neval
 Rcpp_way   446.769   492.308   656.9849   562.667   846.9745   1169.231   100
    other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898   100
MDEWITT
  • 2,338
  • 2
  • 12
  • 23
  • thank you so much for your repply. I ran it in R with: library(Rcpp) cppFunction('NumericMatrix weighted_distance ... return (out) ;}') out <- weighted_distance(DATA1, DATA2, lambda = lambda) It ran really fast indeed. That is good!! But the resulting matrix is [600x50], it should be [600x8]. The results aren't right. It should calculate the distance between each observation of each table. Do you know how to fix this? Can you use the results of the original function for comparisson? I'm not familiar with C++. Thank you very much? – Ana F. May 10 '19 at 13:52
  • Your function seems to be alright, but the results aren't, as I said above. Don't know how to fix it. – Ana F. May 10 '19 at 14:01
  • @AnaF.updated. I can run your original function and the updated Rcpp function and get exactly the same answers within a 600 by 8 matrix – MDEWITT May 10 '19 at 15:56
  • Now it works perfectly!! Problem solved!!! It took me only one minute to run a code tha was taking an hour and a half to run. – Ana F. May 10 '19 at 17:48
  • @AnaF. glad to hear it! `Rcpp` is a big boost when you have lots of loops – MDEWITT May 10 '19 at 19:02