Simple matching similarity matrix for continuous, non-binary data?

Question

Given the matrix

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.

I have tried using dist(test, method="simple matching") from package "proxy" but this method only works for binary data.

flodel · Accepted Answer · 2012-05-25T04:07:59.070

Using outer (again :-)

my.dist <- function(x) {
 n <- nrow(x)
 d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
            Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
 rownames(d) <- names(x)
 colnames(d) <- names(x)
 return(d)
}

my.dist(x)
#           X1        X2  X3  X4        X5
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000

Jesse Anderson · Answer 2 · 2012-05-24T05:03:15.850

2

Here's a shot at it (dt is your matrix):

library(reshape)
df = expand.grid(names(dt),names(dt))
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]]))
cast(df,Var2~Var1)

edited May 24 '12 at 05:03

answered May 24 '12 at 04:11

Jesse Anderson

4,507
26
36

That works great! Thank you very much. There's just one mistake: df2 = df on line 3. – Werner Hertzog May 24 '12 at 04:22

Chase · Answer 3 · 2012-05-25T03:19:14.773

2

Here's a solution that is faster than the other two, though a bit ugly. I assume the speed bumps come from not using mean() as it can be slow compared to sum(), and also only computing half of the output matrix and then filling the lower triangle manually. The function currently leaves NA on the diagonal, but you can easily set those to one to completely match the other answers with diag(out) <- 1

FUN <- function(m) {
  #compute all the combinations of columns pairs
  combos <- t(combn(ncol(m),2))
  #compute the similarity index based on the criteria defined
  sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0) / nrow(m))
  combos <- cbind(combos, sim)
  #dimensions of output matrix
  out <- matrix(NA, ncol = ncol(m), nrow = ncol(m))

  for (i in 1:nrow(combos)){
    #upper tri
    out[combos[i, 1], combos[i, 2]] <- combos[i,3]
    #lower tri
    out[combos[i, 2], combos[i, 1]] <- combos[i,3]
  }
  return(out)
}

I took the other two answers, made them into functions, and did some benchmarking:

library(rbenchmark)
benchmark(chase(m), flodel(m), blindJessie(m), 
          replications = 1000,
          order = "elapsed", 
          columns = c("test", "elapsed", "relative"))
#-----
       test elapsed relative
1  chase(m)   1.217 1.000000
2 flodel(m)   1.306 1.073131
3 blindJessie(m)  17.691 14.548520

edited May 25 '12 at 03:19

answered May 24 '12 at 04:35

Chase

67,710
18
144
161

1

Chase, there is a bug in your code: you cannot use `combos` after you do `transform(combos, ...)` because `...` will be evaluated inside `combos`. I suspect you had another copy of `combos` in your global environment so it worked for you. It should be an easy fix to make a copy of combos before calling `transform` though. – flodel May 24 '12 at 11:24
1

@flodel - good catch, thanks. Made the appropriate adjustments and redid the timings. sticking with matrices and cbind sped the function up as well. – Chase May 24 '12 at 13:13
1

Well then you might run them again as I have also improved the speed of my answer. On my machine, my version is still a little slower than yours, but not so much: the ratio went down to 1.07. – flodel May 24 '12 at 15:34
2

@flodel - nice work, I get equivalent testings. I like your answer since it's more canonical. I think you can eek a tiny bit more performance (easily) by changing the `outer(names(x), names(x)` bit to `outer(seq.int(ncol(x)), seq.int(ncol(x))` since it is a primitive. I also think `names()` will fail if the matrix doesn't have names. When I made that change, your tested within 1.02 of my hack job...that's probably enough micro optimization for one night :). – Chase May 25 '12 at 03:22
1

Good point Chase, I have made the change you suggested. Thanks! – flodel May 25 '12 at 04:10

Werner Hertzog · Answer 4 · 2020-09-12T17:00:12.197

1

Thank you all for your suggestions. Based on your answers I elaborated a three line solution ("test" is the name of the dataset).

require(proxy)
ff <- function(x,y) sum(x == y) / NROW(x)
dist(t(test), ff, upper=TRUE)

Here is the output:

          X1        X2        X3        X4        X5
X1           0.0000000 0.0000000 0.0000000 0.3333333
X2 0.0000000           0.5000000 0.5000000 0.1666667
X3 0.0000000 0.5000000           0.5000000 0.0000000
X4 0.0000000 0.5000000 0.5000000           0.0000000
X5 0.3333333 0.1666667 0.0000000 0.0000000

edited Sep 12 '20 at 17:00

answered May 25 '12 at 02:49

Werner Hertzog

2,002
3
24
36

1

I can't get this to work, `ff` is not defined...even when I changed it to `f`, it failed with `Error in as.character(x) : cannot coerce type 'closure' to vector of type 'character'` – Chase May 25 '12 at 03:24
I think it's because the "dist" function I'm using is the one from package proxy. I'll add "require(proxy)" to the code. – Werner Hertzog May 25 '12 at 15:52

score 1 · Answer 5 · edited Feb 18 '17 at 17:52

I have got the answer as follows: 1st I have made some modifications on the row data as:

X1 = c(1L, 2L, 3L, 4L, 2L, 5L)
X2 = c(2L, 3L, 4L, 5L, 3L, 6L)
X3 = c(3L, 4L, 4L, 5L, 3L, 2L)
X4 = c(2L, 4L, 6L, 5L, 3L, 8L)
X5 = c(1L, 3L, 2L, 4L, 6L, 4L)
matrix_cor=rbind(x1,x2,x3,x4,x5)
matrix_cor

   [,1] [,2] [,3] [,4] [,5] [,6]
X1    1    2    3    4    2    5
X2    2    3    4    5    3    6
X3    3    4    4    5    3    2
X4    2    4    6    5    3    8
X5    1    3    2    4    6    4

then:

dist(matrix_cor)

     X1       X2       X3       X4
X2 2.449490                           
X3 4.472136 4.242641                  
X4 5.000000 3.000000 6.403124         
X5 4.358899 4.358899 4.795832 6.633250

Hi. Thanks for the answer: I edited it so that code was readable. In the future, please format your answers to facilitate reading (http://stackoverflow.com/editing-help) — lbusett, Feb 18 '17 at 15:03

Simple matching similarity matrix for continuous, non-binary data?

5 Answers5