0

The data frame I have is as follows:

Binning_data[1:4,]
  person_id  V1  V2  V3  V4    V5  V6  V7  V8    V9 V10 V11 V12 V13 V14 V15 V16
1       312  74  80  NA  87  90.0  85  88  98  96.5  99  94  95  90  90  93 106
2       316  NA  NA 116 106 105.0 110 102 105 105.0 102  98 101  98  92  89  91
3       318  71  61  61  61  60.5  68  62  67  64.0  60  59  60  62  59  63  63
4       319  64  NA  80  80  83.0  84  87  83  85.0  88  87  95  74  70  63  83

I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs.

For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins)

Dimension of Binning_Data is 10000*17.

The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'.

I am currently using sapply for this purpose:

index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)

distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {

compare_person<-binning_data[j,]    
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))    
})

This seems to take a considerable amount of time. Is there a better way to do this?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196

1 Answers1

1

If I run your code I get:

 0.0000 146.0192 890.9000 200.8750

If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. This can be done over columns using colMeans. Here for row II of your sample data:

> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
       1        2        3        4 
  0.0000 146.0192 890.9000 200.8750 

Your 10000x2 matrix is then (where here 10000==4):

> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
  II         
1  1   0.0000
2  1 146.0192
3  1 890.9000
4  1 200.8750

If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change:

II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
   i         d
1  1    0.0000
2  1  146.0192
3  1  890.9000
4  1  200.8750
11 2  146.0192
21 2    0.0000
31 2 1595.0179
41 2  456.7143
12 1    0.0000
22 1  146.0192
32 1  890.9000
42 1  200.8750
13 4  200.8750
23 4  456.7143
33 4  420.8833
43 4    0.0000
14 4  200.8750
24 4  456.7143
34 4  420.8833
44 4    0.0000

That's a 4 x length(II)-row matrix

Spacedman
  • 92,590
  • 12
  • 140
  • 224