The data frame I have is as follows:
Binning_data[1:4,]
person_id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 312 74 80 NA 87 90.0 85 88 98 96.5 99 94 95 90 90 93 106
2 316 NA NA 116 106 105.0 110 102 105 105.0 102 98 101 98 92 89 91
3 318 71 61 61 61 60.5 68 62 67 64.0 60 59 60 62 59 63 63
4 319 64 NA 80 80 83.0 84 87 83 85.0 88 87 95 74 70 63 83
I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs.
For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins)
Dimension of Binning_Data is 10000*17.
The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'.
I am currently using sapply for this purpose:
index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)
distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {
compare_person<-binning_data[j,]
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))
})
This seems to take a considerable amount of time. Is there a better way to do this?