2

I am tasked with creating a distance matrix function based on a custom defined distance measure. The distance measure is as follows:

wabs_dist = function(u, v, w){
   return( sum((abs(u-v))*w) )
}

Where u and v are vectors and w is a weight.

The problem I am to solve:

I am to create a distance matrix function create-dm(x,w) that returns a distance matrix for the objects in dataframe x by calling the wabs-dist(a,b,w) for all pairs of objects a and b belonging to x. If x is a data set with 4 attributes then w is a vector e.g w = c(1,1,3,2) assigned to each attribute. Yes there are already standard functions like dist() but I am to create my own here using the wabs_dist.

My solution so far:

create_dm = function(x, w){ #x is a dataframe
distances = matrix(0, nrow = nrow(x), ncol = nrow(x))
for (i in 1:nrow(x)) {
 for(j in 1:(i-1)){
     distances[i, j] = wabs_dist(x[i,], x[j,], w)
     distances[j, i] = distances[i, j]
   }
}
 return(distances)  
}

How do i implement a vector of weights because i wrote this function with the mindset of passing in just one weight but now i have to write it to accept a list. How do i do implement this function using the list of weights?

This function takes A LOT of time to run. In fact it never actually prints out the distance matrix function. I cant figure out why

An example:

Let x be a data frame containing vectors a, b and c where: a: (1, 2) b: (4, 5) c: (9, 12)

w is weight vector: (0.2, 0.3)

wabs-dist(a,b,w) = 1.5 wabs-dist(b,c,w) = 3.1

create-dm(x,w)=

0     1.5   4.6

1.5   0     3.1

4.6   3.1   0
Devee
  • 61
  • 9
  • You don't explain why you put a test for j (whatever it might be) being positive. That's one of the reasons it's so slow. Also, this suggests `w` might be a constant in which case there are much faster ways to do this. PLEASE give a complete description of the problem. – IRTFM Oct 11 '18 at 05:48
  • Please clarify whether you want distance between all pairs of vectors in the matrix or you want every element's distance computed with every other's. If it's the latter then you are looking at `n C 2` number of computations in case of a perfect `n X n` matrix. – Saket Kumar Singh Oct 11 '18 at 06:07
  • @42- what do you mean by constant? Its a vector for each attribute – Devee Oct 11 '18 at 06:12
  • @SaketKumarSingh between elements – Devee Oct 11 '18 at 06:45
  • In the expression `wabs_dist(x[i,], x[j,], w)` you have two single items followed by something you didn't define. It makes little sense to have w be a more than a length-1 vector which I was sloppily referring to as a constant. This question calls out for a simple example where all the objects are defined and the correct answer given. – IRTFM Oct 11 '18 at 15:47
  • @42- added example – Devee Oct 11 '18 at 20:00
  • You don't seem to be using R code. Are you sure you picked the correct tag? – IRTFM Oct 11 '18 at 21:38
  • @42- yes i am using the correct tag. – Devee Oct 11 '18 at 22:02
  • Then you have not paid sufficient attention to [MCVE] – IRTFM Oct 11 '18 at 22:24

1 Answers1

1

I had a similar problem lately. My final solution was to write it in C++ with the Rcpp package. Save this Code as dmat.cpp

#include <Rcpp.h>

using namespace Rcpp;


// [[Rcpp::export]]
NumericMatrix dmat(NumericMatrix x, NumericVector w) {
  int n = x.nrow();
  NumericMatrix d = no_init_matrix(n, n);

  for(int i=0; i<n;i++){
    for(int j=i+1; j<n;j++){
      d(i,j)=sum(w*abs((x(i,_)-x(j,_))));
      d(j,i)=d(i,j);
    }

    d(i,i)=0;

  }
  return d;
}

Then install and load the package "Rcpp" and use sourceCpp() to load the function. After that you can use it like any other R function

library(Rcpp)
sourceCpp("path/to/file/dmat.cpp")

x <- matrix(rnorm(1500),ncol=3)
w <- 1:3

system.time(distR <- create_dm(x,w))
       User      System verstrichen 
   1.81        0.02        1.84 

system.time(distCpp <- dmat(x,w))
       User      System verstrichen 
      0           0           0

identical(round(distR,10), round(distCpp,10))
[1] TRUE

If you just use identical() without rounding it gives FALSE. Don't know why. Maybe this can be answered by someone else.

If you can use the euclidean distance instead of absolute distance you could use the package apcluster. This was my first solution. But the C++ solution was still faster .

To Mate
  • 51
  • 6
  • how do you write the for loop in r? The for (j =i+1; j – Devee Oct 11 '18 at 15:27
  • where did you get the rcpp armadillo from. I tried installing it but its still saying that it cant find the .h file – Devee Oct 11 '18 at 15:38
  • Sorry, just copied my Code and didn't pay attention. You have to install the package `RcppArmadillo` in R if you use this header. – To Mate Oct 11 '18 at 16:13
  • I edited my Code. If you use the new Code the package 'Rcpp' should be enough. With the old code you need the `RcppArmadillo` addtionally. I'm also new to Rcpp, but I was very surprised about the speed differences. – To Mate Oct 11 '18 at 16:20