6

I'd like to create a distance-matrix with weighted euclidean distances from a data frame. The weights will be defined in a vector. Here's an example:

library("cluster")

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df <- data.frame(a,b,c)

weighting <- c(1, 2, 3)

dm <- as.matrix(daisy(df, metric = "euclidean", weights = weighting))

I've searched everywhere and can't find a package or solution to this in R. The 'daisy' function within the 'cluster' package claims to support weighting, but the weights don't seem to be applied and it just spits out regular euclid. distances.

Any ideas Stack Overflow?

www
  • 38,575
  • 12
  • 48
  • 84
h7681
  • 355
  • 4
  • 13
  • https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/daisy.html I might have been mistaken, actually. The document seems to say that weighting only works with a Gower distance. Nonetheless, my question still stands: Is there a package that supports weighted Euclidean distances? – h7681 Aug 30 '16 at 20:58
  • I think you need to show the formula for a "weighted distance". – IRTFM Aug 30 '16 at 21:16
  • http://images.slideplayer.com/16/5203007/slides/slide_49.jpg So in the example (which I've corrected) if we wanted the distance between row 1 and 2 it would be calculated as: distance = 1*(1-2)^2 + 2*(5-4)^2 + 3*(5-4)^2 The distance calculation is to be applied over a large data set where the number of variables and weightings will differ on each run. So it's not as simple (or at least above my skill level) of just writing my own function, hence why I'm searching for a package. – h7681 Aug 30 '16 at 21:36
  • It looks like others have written their own function. You can probably try to recreate. – Pierre L Aug 30 '16 at 22:00
  • 2
    You could scale vectors by the square root of the weights (multiplying each dimension by its own scale factor, not a common vector operation), then carry on with euclidean distances. Have no idea how to do that in R, though. – Walter Tross Aug 30 '16 at 22:01
  • @PierreLafortune I suspected it could be as simple as that in R! – Walter Tross Aug 30 '16 at 22:09
  • It's actually more like `sweep(df, 1, weighting, function(x, y) x*sqrt(y))` – Pierre L Aug 30 '16 at 22:42
  • OK @PierreLafortune, time to write your answer (optimizing that sqrt() out of the loop, though)... – Walter Tross Aug 30 '16 at 23:12
  • @WalterTross Can you show an example of using the square root of the weight to multiply against a dataset for scaling? – Pierre L Aug 31 '16 at 02:52
  • I know how to code it, I mean the statistical reasoning for it – Pierre L Aug 31 '16 at 02:52
  • @PierreLafortune No statistical reasoning, only geometry, but you are right, see my comment to your answer. – Walter Tross Aug 31 '16 at 08:52

1 Answers1

8

We can use @WalterTross' technique of scaling by multiplying each column by the square root of its respective weight first:

newdf <- sweep(df, 2, weighting, function(x,y) x * sqrt(y))
as.matrix(daisy(newdf, metric="euclidean"))

But just in case you would like to have more control and understanding of what euclidean distance is, we can write a custom function. As a note, I have chosen a different weighting method. :

xpand <- function(d) do.call("expand.grid", rep(list(1:nrow(d)), 2))
euc_norm <- function(x) sqrt(sum(x^2))
euc_dist <- function(mat, weights=1) {
  iter <- xpand(mat)
  vec <- mapply(function(i,j) euc_norm(weights*(mat[i,] - mat[j,])), 
                iter[,1], iter[,2])
  matrix(vec,nrow(mat), nrow(mat))
}

We can test the result by checking against the daisy function:

#test1
as.matrix(daisy(df, metric="euclidean"))
#          1        2        3        4        5
# 1 0.000000 1.732051 4.898979 5.196152 6.000000
# 2 1.732051 0.000000 3.316625 3.464102 4.358899
# 3 4.898979 3.316625 0.000000 1.732051 3.464102
# 4 5.196152 3.464102 1.732051 0.000000 1.732051
# 5 6.000000 4.358899 3.464102 1.732051 0.000000

euc_dist(df)
#          [,1]     [,2]     [,3]     [,4]     [,5]
# [1,] 0.000000 1.732051 4.898979 5.196152 6.000000
# [2,] 1.732051 0.000000 3.316625 3.464102 4.358899
# [3,] 4.898979 3.316625 0.000000 1.732051 3.464102
# [4,] 5.196152 3.464102 1.732051 0.000000 1.732051
# [5,] 6.000000 4.358899 3.464102 1.732051 0.000000

The reason I doubt Walter's method is because firstly, I've never seen weights applied by their square root, it's usually 1/w. Secondly, when I apply your weights to my function, I get a different result.

euc_dist(df, weights=weighting) 
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • There is more than one way of weighting. I would have scaled every axis by `w` to put `w` times the weight. For Manhattan fistance this clearly gives the desired effect. Euclidean takes the square, but who says it's not doing `(w*(x_i-y_i))^2`? To me, this seems to be the least-surprise weighting scheme. – Has QUIT--Anony-Mousse Aug 31 '16 at 08:24
  • @Anony-Mousse you are right, I also would have scaled every axis by its weight instead of the square root thereof. Euclidean distance is the square root of the sum of square deltas, so in fact the OP, in their comment to their question, uses the wrong definition of distance. I stuck to that, which made me introduce square roots of weights, but that is a bad idea. – Walter Tross Aug 31 '16 at 08:51
  • The squared euclidean distance (the sum of squared deltas) is of course useful if only comparisons are needed, because it saves the computationally heavy square root extraction step, but weights should stay defined in standard euclidean metric. BTW euclidean and Manhattan distances are equal when deltas in all dimensions but one are zero. – Walter Tross Aug 31 '16 at 09:24
  • @Anony-Mousse I did not make it clear in my write-up that I chose a different weighting technique. – Pierre L Aug 31 '16 at 11:20
  • Great post, thanks. The scaling method did the trick and I'll experiment with the other techniques at some point. – h7681 Sep 01 '16 at 09:03