r distance between rows

Question

I apologize this is my attempt at redeeming myself after a disastrous earlier attempt . Now I have a bit more clarity. So here I go again.

My goal is to find rows that are similar. So first I am interested in calculating the distance between rows. This is a test dataset below.

 Row            Blood   x1    x2    x3    x4
 1              A       0.01  0.16  0.31  0.46
 2              A       0.02  0.17  0.32  0.47
 3              A       0.03  0.18  0.33  0.48

 4              B       0.05  0.20  0.35  0.49
 5              B       0.06  0.21  0.36  0.50
 6              B       0.07  0.22  0.37  0.51

 7              AB      0.09  0.24  0.39  0.52
 8              AB      0.1   0.25  0.4   0.53
 9              AB      0.11  0.26  0.41  0.54

 10             O       0.13  0.28  0.43  0.55
 11             O       0.14  0.29  0.44  0.56
 12             O       0.15  0.3   0.45  0.57

There are two things here 1) Distance 2) Rows

Consider this row combination.

For Row(1-4-7-10) , distance D = (d1,4 + d1,7 + d1,10 + d4,7 + d4,10 + d7,10)/6

{ Row1-Blood A, Row1-Blood B, Row1- Blood AB, Row1- Blood O }

Distance between Row{1,4,7,10} is calculated based on this concept

d1,4  = Distance between : Row1-Blood A, Row1-Blood B  
d1,7  = Distance between : Row1-Blood A, Row1-Blood AB  
d1,10 = Distance between : Row1-Blood A, Row1-Blood O       
d4,7  = Distance between : Row1-Blood B, Row1-Blood AB  
d4,10 = Distance between : Row1-Blood B, Row1-Blood O     
d7,10 = Distance between : Row1-Blood AB, Row1-Blood O     

d-1-4   = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 +  (0.46-0.49)^2             
d-1-7   = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 +  (0.46-0.52)^2             
d-1-10  = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 +  (0.46-0.55)^2             
d-4-7   = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 +  (0.49-0.52)^2   
d-4-10  = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 +  (0.49-0.55)^2     
d-7-10  = (0.09-0.13)^2 + (0.24-0.30)^2 + (0.39-0.43)^2 +  (0.52-0.55)^2

Similarly I am interested in calculating the distances between 81 different row combinations (3*3*3*3).

The final expected dataset should look like this below.

 Row         Distance
 1-4-7-10
 1-4-7-11     
 1-4-7-12

 1-4-8-10          
 1-4-8-11          
 1-4-8-12

 1-4-9-10                    
 1-4-9-11                    
 1-4-9-12

 1-5-7-10
 1-5-7-11     
 1-5-7-12

 1-5-8-10          
 1-5-8-11          
 1-5-8-12

 1-5-9-10                    
 1-5-9-11                    
 1-5-9-12  

 1-6-7-10
 .
 .
 .
 3-6-9-12

I know I can do this with 4 nested loops and lists. I am wondering if there is a more efficient way to accomplish this.

`?dist` works on whole matrices for every combination - `dist(dat[-(1:2)])` for instance, which might be a good starting point. — thelatemail, May 02 '19 at 21:57
This sounds structurally similar to a `k-means` approach, where you identify clusters distinguished by having the smallest distance to a common centroid. — Jon Spring, May 02 '19 at 22:03
@thelatemail, that is correct. I could use the dist function to calculate the distance instead of the current `(x1-y1) + (x2-y2) + (x3-y3)` approach. — Emily Fassbender, May 02 '19 at 22:26
I think some of your manual calculations are wrong. Line 4-7 and line 4-10 seem to be repeating one another. — thelatemail, May 02 '19 at 22:48
@thelatemail, you are right about that. I have fixed this now. Thanks latemail. — Emily Fassbender, May 03 '19 at 04:54

thelatemail · Answer 1 · 2019-05-03T00:38:39.310

2

Similar to the other solution, but I think you could do some matrix indexing inside the function applied to each combination to select the correct cells to add up:

Be aware that the default ?dist calculation is:

sqrt(sum((x_i - y_i)^2))

...while you are using:

sum((x_i - y_i)^2)

...so I square the result below:

dd <- as.matrix(dist(dat[-(1:2)]))^2

apply(
  expand.grid(split(dat$Row, dat$Blood)),
  1,
  function(x) sum(dd[t(combn(x,2))])
)
#  [1] 0.1140 0.0972 0.0828 0.1212 0.1036 0.0884 0.1308 ...

Checks out versus manual calculation for the first desired result:

L <- c(
d1_4   = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 +  (0.46-0.49)^2,           
d1_7   = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 +  (0.46-0.52)^2,             
d1_10  = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 +  (0.46-0.55)^2,             
d4_7   = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 +  (0.49-0.52)^2,   
d4_10  = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 +  (0.49-0.55)^2,     
d7_10  = (0.09-0.13)^2 + (0.24-0.28)^2 + (0.39-0.43)^2 +  (0.52-0.55)^2
)
sum(L)
# 0.114

edited May 03 '19 at 00:38

answered May 02 '19 at 22:53

thelatemail

91,185
12
128
188

thanks latemail this solution worked for a small dataset. However I am getting an error `Error: cannot allocate vector of size 680603.5 Gb` when I try this on a larger dataset which has about 863 cases of blood group A, 8676 cases of Blood Group AB, 7951 cases of Blood Group B and 6984 cases of blood group O. Would there be an alternate solution that could scale to a larger dataset with few thousand cases in each group ?? – Emily Fassbender May 03 '19 at 04:50
1

@EmilyFassbender - I'm not surprised. What are you going to do with 415,000,000,000,000 comparisons? – thelatemail May 03 '19 at 05:02
good question latemail only "retain" rows where the distance is smaller than or equal to a fixed threshold value , let us say `0.1`, `dist <=0.1` – Emily Fassbender May 03 '19 at 05:04
@EmilyFassbender - Even with very efficient code, you'll be waiting hours if not days to run that many calculations using something straightforward. The whole problem needs to be reconceptualised to something much more efficient, and I'm not sure how to do that really. – thelatemail May 03 '19 at 05:13
mater of fact, `as.matrix(dist...` function worked fine without any issues. It is that part of transforming the results to match the row-combination, `apply( expand.grid( ` that is causing this problem. Knowing which row combination lead to which results is very important here because without that I wouldn't know what rows the distance measure is pointing to. – Emily Fassbender May 03 '19 at 05:17

score 2 · Answer 2 · answered May 03 '19 at 07:24

"My goal is to find rows that are similar."

Two possible approaches:

1) k-means, to separate all data into k different clusters, identified to find the smallest distance to the centroid of each cluster.

blood_fake$cluster_assignment <- kmeans(blood_fake[, -c(1:2)], centers = 10)$cluster

library(ggplot2)  
ggplot(blood_fake, aes(x1, x2, color = as.factor(cluster_assignment))) + 
  geom_point(size = 0.3) + 
  theme_minimal() + 
  theme(legend.position = "bottom")

2) fuzzyjoin::distance_left_join can be used to find matches that are within a distance threshold. It worked ok for me with 10,000 rows on an old computer with 4 GB RAM if I ran separately on subsets, but froze up when I tried with all at once.

library(tidyverse); library(fuzzyjoin)
blood_fake %>%
  filter(type == "A") %>%
  distance_left_join(blood_fake,  by = c("x1", "x2", "x3", "x4"), distance_col = "dist", max_dist = 0.05) %>%
  filter(dist > 0) %>%
  arrange(dist)

#   row.x type.x        x1.x        x2.x        x3.x        x4.x row.y type.y       x1.y        x2.y        x3.y        x4.y        dist
#1   8362      A 0.055618062 0.008783874 0.001162073 0.145280936  4786      B 0.05807814 0.009353543 0.002046247 0.146829206 0.003091180
#2   4284      A 0.163417186 0.032845642 0.114224202 0.339505310  2060     AB 0.16676132 0.031621044 0.115635984 0.339447690 0.003831363
#3   8338      A 0.194389332 0.070951537 0.132582667 0.004634504  4839     AB 0.19793256 0.067944898 0.130012004 0.005525959 0.005384918
#4   6849      A 0.277700944 0.027618307 0.034390833 0.158798952  7698      A 0.27344845 0.025502562 0.033016888 0.160972663 0.005401185
#5   7698      A 0.273448453 0.025502562 0.033016888 0.160972663  6849      A 0.27770094 0.027618307 0.034390833 0.158798952 0.005401185
#6   4281      A 0.281189896 0.323468620 0.107589336 0.096526579  6251      A 0.27891482 0.321343667 0.109619143 0.100667052 0.005563725

test data

n <- 10000
set.seed(42)
blood_fake <- data.frame(row = 1:n,
                         type = sample(c("A","B","AB","O"), n, replace = T),
                         x1 = runif(n, min = 0, max = 0.5),
                         x2 = runif(n, min = 0, max = 0.5),
                         x3 = runif(n, min = 0, max = 0.5),
                         x4 = runif(n, min = 0, max = 0.5)
                         )

thanks Jon. I know this does something similar but not exactly what I am trying to do. I will explore this further. — Emily Fassbender, May 04 '19 at 03:14

r distance between rows

2 Answers2