Fast computation of average proximity in proximity matrix

Question

I've got a similarity matrix between all cases and, in a separate data frame, classes of these cases. I want to compute average similarity between cases from the same class, here is the equation for an example n from class j:

Average proximity between cases

We have to compute a sum of all squared proximities between n and all cases k that come from the same class as n. Link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers

I implemented that with 2 for loops, but it is really slow. Is there a faster way to do such thing in R?

Thanks.

//DATA (dput)

Data frame with classes:

structure(list(class = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 1L, 
                            1L, 2L, 3L), .Label = c("1", "2", "3", "5", "6", "7"), class = "factor")), .Names = "class", row.names = c(NA, 
            -10L), class = "data.frame")

Proximity matrix (row m and column m correspond to class in row m of data frame above):

structure(c(1, 0.60996875, 0.51775, 0.70571875, 0.581375, 0.42578125, 
0.6595, 0.7134375, 0.645375, 0.468875, 0.60996875, 1, 0.77021875, 
0.55171875, 0.540375, 0.53084375, 0.4943125, 0.462625, 0.7910625, 
0.56321875, 0.51775, 0.77021875, 1, 0.451375, 0.60353125, 0.62353125, 
0.5203125, 0.43934375, 0.6909375, 0.57159375, 0.70571875, 0.55171875, 
0.451375, 1, 0.69196875, 0.59390625, 0.660375, 0.76834375, 0.606875, 
0.65834375, 0.581375, 0.540375, 0.60353125, 0.69196875, 1, 0.7194375, 
0.684, 0.68090625, 0.50553125, 0.60234375, 0.42578125, 0.53084375, 
0.62353125, 0.59390625, 0.7194375, 1, 0.53665625, 0.553125, 0.513, 
0.801625, 0.6595, 0.4943125, 0.5203125, 0.660375, 0.684, 0.53665625, 
1, 0.8456875, 0.52878125, 0.65303125, 0.7134375, 0.462625, 0.43934375, 
0.76834375, 0.68090625, 0.553125, 0.8456875, 1, 0.503, 0.6215, 
0.645375, 0.7910625, 0.6909375, 0.606875, 0.50553125, 0.513, 
0.52878125, 0.503, 1, 0.60653125, 0.468875, 0.56321875, 0.57159375, 
0.65834375, 0.60234375, 0.801625, 0.65303125, 0.6215, 0.60653125, 
1), .Dim = c(10L, 10L))

Correct result:

c(2.44197227050781, 2.21901680175781, 2.07063155175781, 2.52448621289062, 
1.88040830957031, 2.16019295703125, 2.58622273828125, 2.81453253222656, 
2.1031745078125, 2.00542063378906)

IRTFM · Accepted Answer · 2012-10-04T00:53:39.833

1

Should be possible. Your notation does not make clear whether we will find members of like classes in the rows or columns, so this answer presumes in the columns but the obvious modifications would work as well if they were in rows.

colSums(mat^2))  # in R this is element-wise application of ^2 rather than matrix multiplication.

Since both operations are vectorized it would be expected to be much faster than for-loops.

With the modification and assuming the matrix is named 'mat' and the class-dataframe named 'cldf':

sapply( 1:nrow(mat) , 
           function(r) sum(mat[r, cldf[['class']][r] ==  cldf[['class']] ]^2)  )
[1] 2.441972 2.219017 2.070632 2.524486 1.880408 2.160193 2.586223 2.814533 2.103175 2.005421

edited Oct 04 '12 at 00:53

answered Oct 03 '12 at 21:59

IRTFM

258,963
21
364
487

The problem is, that we can't tell which examples are from the same class just from the matrix. We have to look it up from the data frame. – Uros K Oct 03 '12 at 22:04
In that case your posting of a representative example and a "correct answer" are way overdue. – IRTFM Oct 03 '12 at 22:06
I think I can modify the matrix (or for each class use some subset of the matrix) and use your answer. I'll try and see how fast it is. Thanks four your help. – Uros K Oct 03 '12 at 22:16
I have some further approaches in mind that would require an example on which they could be tested. – IRTFM Oct 03 '12 at 22:54
I provided some data: classes, proximity matrix and correct result. – Uros K Oct 03 '12 at 23:32
It is much faster. Thank you. – Uros K Oct 04 '12 at 11:00

Fast computation of average proximity in proximity matrix

1 Answers1