5

I apologize this is my attempt at redeeming myself after a disastrous earlier attempt . Now I have a bit more clarity. So here I go again.

My goal is to find rows that are similar. So first I am interested in calculating the distance between rows. This is a test dataset below.

 Row            Blood   x1    x2    x3    x4
 1              A       0.01  0.16  0.31  0.46
 2              A       0.02  0.17  0.32  0.47
 3              A       0.03  0.18  0.33  0.48

 4              B       0.05  0.20  0.35  0.49
 5              B       0.06  0.21  0.36  0.50
 6              B       0.07  0.22  0.37  0.51

 7              AB      0.09  0.24  0.39  0.52
 8              AB      0.1   0.25  0.4   0.53
 9              AB      0.11  0.26  0.41  0.54

 10             O       0.13  0.28  0.43  0.55
 11             O       0.14  0.29  0.44  0.56
 12             O       0.15  0.3   0.45  0.57

There are two things here 1) Distance 2) Rows

Consider this row combination.

For Row(1-4-7-10) , distance D = (d1,4 + d1,7 + d1,10 + d4,7 + d4,10 + d7,10)/6

{ Row1-Blood A, Row1-Blood B, Row1- Blood AB, Row1- Blood O } 

Distance between Row{1,4,7,10} is calculated based on this concept

d1,4  = Distance between : Row1-Blood A, Row1-Blood B  
d1,7  = Distance between : Row1-Blood A, Row1-Blood AB  
d1,10 = Distance between : Row1-Blood A, Row1-Blood O       
d4,7  = Distance between : Row1-Blood B, Row1-Blood AB  
d4,10 = Distance between : Row1-Blood B, Row1-Blood O     
d7,10 = Distance between : Row1-Blood AB, Row1-Blood O     

d-1-4   = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 +  (0.46-0.49)^2             
d-1-7   = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 +  (0.46-0.52)^2             
d-1-10  = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 +  (0.46-0.55)^2             
d-4-7   = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 +  (0.49-0.52)^2   
d-4-10  = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 +  (0.49-0.55)^2     
d-7-10  = (0.09-0.13)^2 + (0.24-0.30)^2 + (0.39-0.43)^2 +  (0.52-0.55)^2  

Similarly I am interested in calculating the distances between 81 different row combinations (3*3*3*3).

The final expected dataset should look like this below.

 Row         Distance
 1-4-7-10
 1-4-7-11     
 1-4-7-12

 1-4-8-10          
 1-4-8-11          
 1-4-8-12

 1-4-9-10                    
 1-4-9-11                    
 1-4-9-12

 1-5-7-10
 1-5-7-11     
 1-5-7-12

 1-5-8-10          
 1-5-8-11          
 1-5-8-12

 1-5-9-10                    
 1-5-9-11                    
 1-5-9-12  

 1-6-7-10
 .
 .
 .
 3-6-9-12

I know I can do this with 4 nested loops and lists. I am wondering if there is a more efficient way to accomplish this.

Emily Fassbender
  • 413
  • 5
  • 15

2 Answers2

2

Similar to the other solution, but I think you could do some matrix indexing inside the function applied to each combination to select the correct cells to add up:

Be aware that the default ?dist calculation is:

sqrt(sum((x_i - y_i)^2))

...while you are using:

sum((x_i - y_i)^2)

...so I square the result below:

dd <- as.matrix(dist(dat[-(1:2)]))^2

apply(
  expand.grid(split(dat$Row, dat$Blood)),
  1,
  function(x) sum(dd[t(combn(x,2))])
)
#  [1] 0.1140 0.0972 0.0828 0.1212 0.1036 0.0884 0.1308 ...

Checks out versus manual calculation for the first desired result:

L <- c(
d1_4   = (0.01-0.05)^2 + (0.16-0.20)^2 + (0.31-0.35)^2 +  (0.46-0.49)^2,           
d1_7   = (0.01-0.09)^2 + (0.16-0.24)^2 + (0.31-0.39)^2 +  (0.46-0.52)^2,             
d1_10  = (0.01-0.13)^2 + (0.16-0.28)^2 + (0.31-0.43)^2 +  (0.46-0.55)^2,             
d4_7   = (0.05-0.09)^2 + (0.20-0.24)^2 + (0.35-0.39)^2 +  (0.49-0.52)^2,   
d4_10  = (0.05-0.13)^2 + (0.20-0.28)^2 + (0.35-0.43)^2 +  (0.49-0.55)^2,     
d7_10  = (0.09-0.13)^2 + (0.24-0.28)^2 + (0.39-0.43)^2 +  (0.52-0.55)^2
)
sum(L)
# 0.114
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • thanks latemail this solution worked for a small dataset. However I am getting an error `Error: cannot allocate vector of size 680603.5 Gb` when I try this on a larger dataset which has about 863 cases of blood group A, 8676 cases of Blood Group AB, 7951 cases of Blood Group B and 6984 cases of blood group O. Would there be an alternate solution that could scale to a larger dataset with few thousand cases in each group ?? – Emily Fassbender May 03 '19 at 04:50
  • 1
    @EmilyFassbender - I'm not surprised. What are you going to do with 415,000,000,000,000 comparisons? – thelatemail May 03 '19 at 05:02
  • good question latemail only "retain" rows where the distance is smaller than or equal to a fixed threshold value , let us say `0.1`, `dist <=0.1` – Emily Fassbender May 03 '19 at 05:04
  • @EmilyFassbender - Even with very efficient code, you'll be waiting hours if not days to run that many calculations using something straightforward. The whole problem needs to be reconceptualised to something much more efficient, and I'm not sure how to do that really. – thelatemail May 03 '19 at 05:13
  • mater of fact, `as.matrix(dist...` function worked fine without any issues. It is that part of transforming the results to match the row-combination, `apply( expand.grid( ` that is causing this problem. Knowing which row combination lead to which results is very important here because without that I wouldn't know what rows the distance measure is pointing to. – Emily Fassbender May 03 '19 at 05:17
2

"My goal is to find rows that are similar."

Two possible approaches:

1) k-means, to separate all data into k different clusters, identified to find the smallest distance to the centroid of each cluster.

blood_fake$cluster_assignment <- kmeans(blood_fake[, -c(1:2)], centers = 10)$cluster

library(ggplot2)  
ggplot(blood_fake, aes(x1, x2, color = as.factor(cluster_assignment))) + 
  geom_point(size = 0.3) + 
  theme_minimal() + 
  theme(legend.position = "bottom")

enter image description here

2) fuzzyjoin::distance_left_join can be used to find matches that are within a distance threshold. It worked ok for me with 10,000 rows on an old computer with 4 GB RAM if I ran separately on subsets, but froze up when I tried with all at once.

library(tidyverse); library(fuzzyjoin)
blood_fake %>%
  filter(type == "A") %>%
  distance_left_join(blood_fake,  by = c("x1", "x2", "x3", "x4"), distance_col = "dist", max_dist = 0.05) %>%
  filter(dist > 0) %>%
  arrange(dist)

#   row.x type.x        x1.x        x2.x        x3.x        x4.x row.y type.y       x1.y        x2.y        x3.y        x4.y        dist
#1   8362      A 0.055618062 0.008783874 0.001162073 0.145280936  4786      B 0.05807814 0.009353543 0.002046247 0.146829206 0.003091180
#2   4284      A 0.163417186 0.032845642 0.114224202 0.339505310  2060     AB 0.16676132 0.031621044 0.115635984 0.339447690 0.003831363
#3   8338      A 0.194389332 0.070951537 0.132582667 0.004634504  4839     AB 0.19793256 0.067944898 0.130012004 0.005525959 0.005384918
#4   6849      A 0.277700944 0.027618307 0.034390833 0.158798952  7698      A 0.27344845 0.025502562 0.033016888 0.160972663 0.005401185
#5   7698      A 0.273448453 0.025502562 0.033016888 0.160972663  6849      A 0.27770094 0.027618307 0.034390833 0.158798952 0.005401185
#6   4281      A 0.281189896 0.323468620 0.107589336 0.096526579  6251      A 0.27891482 0.321343667 0.109619143 0.100667052 0.005563725

test data

n <- 10000
set.seed(42)
blood_fake <- data.frame(row = 1:n,
                         type = sample(c("A","B","AB","O"), n, replace = T),
                         x1 = runif(n, min = 0, max = 0.5),
                         x2 = runif(n, min = 0, max = 0.5),
                         x3 = runif(n, min = 0, max = 0.5),
                         x4 = runif(n, min = 0, max = 0.5)
                         )
Jon Spring
  • 55,165
  • 4
  • 35
  • 53