Use of function over all row-pairs of two matrices

Question

If I want to calculate the n-dimensional distance of two vectors, I can use a function such as:

a = c(1:10)
b = seq(20, 23, length.out = length(a))

test_fun = 
  function(x,y) {
    return(
      sqrt(
        sum(
          (x - y) ^ 2
        )
      )
    )
  }

n_distance = test_fun(a,b)

Now, I want to expand this to a matrix setting: I want to calculate the n-dimensional distance for each pair of rows of two matrices.

set.seed(123)
a_mtx = matrix(1:30, ncol = 5)
b_mtx = matrix(sample(1:15,15), ncol = 5)

n_distance_mtx = 
matrix(
  NA,
  nrow = nrow(b_mtx), 
  ncol = nrow(a_mtx)
  )
for(i in 1:nrow(b_mtx)) {
 for(j in 1:nrow(a_mtx)) {
  n_distance_mtx[i,j] = 
    test_fun(a_mtx[j,], b_mtx[i,])
 }
}

Where each column of n_distance_mtx contains the distance metrics between each row of a_mtx and b_mtx (so n_distance_mtx[,1] is the distance between a_mtx[1,] and b_mtx[1:3,].

If I calculate column means on n_distance_mtx I can obtain the mean distance between each row in a_mtx and all rows of b_mtx.

colMeans(n_distance_mtx)
#[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220

So 23.79094 is the mean distance between a_mtx[1,] and b_mtx[1:3,], and 24.90281 is the mean distance between a_mtx[2,] and b_mtx[1:3,], and so on.

Question: How can I arrive at the same solution without using for-loops?

I want to apply this method to matrices with much larger dimension (on the order of hundreds of thousands of rows). Looking at this and this, it seems there must be a way to accomplish this with a Vectorized outer function, but I have been unable to generate such a function.

test_fun_vec = 
 Vectorize(
   function(x,y) {
     outer(
       x,
       y,
       test_fun
       )
   }
 )
test_fun_vec(a_mtx,b_mtx)
#[1]  4  0  2  7  4  6  3  5  1  5  7  5 10  0  9 11 15 17  8 11  9 12 10 16
#[25] 10 22 20 25 15 24

akrun · Accepted Answer · 2018-09-19T16:29:21.893

2

We can use Vectorize with outer

f1 <- Vectorize(function(i, j) test_fun(a_mtx[j, ], b_mtx[i, ]))
out <- outer(seq_len(nrow(b_mtx)), seq_len(nrow(a_mtx)), FUN = f1)
out
#         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
#[1,] 20.88061 21.84033 22.97825 24.26932 25.69047 27.22132
#[2,] 24.87971 25.57342 26.43861 27.45906 28.61818 29.89983
#[3,] 25.61250 27.29469 29.05168 30.87070 32.74141 34.65545

colMeans(out)
#[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220

identical(n_distance_mtx, out)
#[1] TRUE

edited Sep 19 '18 at 16:29

answered Sep 19 '18 at 16:18

akrun

874,273
37
540
662

I was wondering why my results are different than yours and the OP, and then I found `sample`. – R. Schifini Sep 19 '18 at 16:25
@R.Schifini yes, it is the `set.seed` which the OP didn't specify – akrun Sep 19 '18 at 16:26
Apologies about that. – Nigel Stackhouse Sep 19 '18 at 16:26
@NigelStackhouse It's okay. I was tryihg your `for` loop output as reference – akrun Sep 19 '18 at 16:27
I added the set.seed I used, though, for future users, these number will still be different since @akrun did his before I added that. His answer is still right. – Nigel Stackhouse Sep 19 '18 at 16:29
@NigelStackhouse Thanks for adding that. I updated the answer to avoid any confusion – akrun Sep 19 '18 at 16:29
2

In terms of time, this answer kicks butt. Thanks so much! Relative time elapsed: n_distance_mtx_apply = 10.2; n_distance_mtx_forloop = 26.4; n_distance_mtx_vec = 1. – Nigel Stackhouse Sep 19 '18 at 16:31
@NigelStackhouse `outer` is very fast. Only thing to keep on checking is the memory availability – akrun Sep 19 '18 at 16:32

R. Schifini · Answer 2 · 2018-09-19T16:52:08.087

If I unsderstood your question right, you want the Euclidean distance between each vector (row) in a_mtx to the other vectors in b_mtx.

If so, you could use apply twice like this:

result = apply(a_mtx, 1, function(x){ apply(b_mtx, 1, function(y){ test_fun(x,y) })})

This gives a distance matrix:

         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,] 20.88061 21.84033 22.97825 24.26932 25.69047 27.22132
[2,] 24.87971 25.57342 26.43861 27.45906 28.61818 29.89983
[3,] 25.61250 27.29469 29.05168 30.87070 32.74141 34.65545

where the row index is the corresponding vector (row) from b_mtx and the column index is the corresponding vector from a_mtx

Finally, obtain the mean distance using:

colMeans(result)
[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220

It would be better to change the output as the OP updated with the `set.seed` — akrun, Sep 19 '18 at 16:30

Use of function over all row-pairs of two matrices

2 Answers2