r match rows that that are similar

Question

I am dealing with few observations like this below. My goal is to identify rows that match/similar to each other based on a Euclidean distance concept, considering vector {x1,x2,x3,x4} and threshold 0.2. Any distance between rows that are less than 0.2 are considered similar.

 Observation    Blood   x1   x2    x3     x4
 1              A      0.01  0.16  0.31  0.46
 2              A      0.02  0.17  0.32  0.47
 3              A      0.03  0.18  0.33  0.48

 4              B      0.05  0.20  0.35  0.49
 5              B      0.06  0.21  0.36  0.50
 6              B      0.07  0.22  0.37  0.51

 7              AB     0.09  0.24  0.39  0.52
 8              AB     0.1   0.25  0.4   0.53
 9              AB     0.11  0.26  0.41  0.54

 10             O      0.13  0.28  0.43  0.55
 11             O      0.14  0.29  0.44  0.56
 12             O      0.15  0.3   0.45  0.57

I can do this using a very clunky double forloop. I am wondering if there is an efficient way to accomplish this.

Expected Output

 Observation    Blood   x1   x2    x3     x4    Match
 1              A      0.01  0.16  0.31  0.46   Yes
 2              A      0.02  0.17  0.32  0.47   Yes
 3              A      0.03  0.18  0.33  0.48   No 

 4              B      0.05  0.20  0.35  0.49   Yes
 5              B      0.06  0.21  0.36  0.50   Yes
 6              B      0.07  0.22  0.37  0.51   No

 7              AB     0.09  0.24  0.39  0.52   No
 8              AB     0.1   0.25  0.4   0.53   Yes
 9              AB     0.11  0.26  0.41  0.54   No

 10             O      0.13  0.28  0.43  0.55   No
 11             O      0.14  0.29  0.44  0.56   Yes
 12             O      0.15  0.3   0.45  0.57   Yes

 Match Dataset

 RowToBeMatched      FoundMatches_Bgroup_B  FoundMatches_Bgroup_AB  FoundMatches_Bgroup_O
 1                   4                      8                       11    
 2                   5                      NA                      12

So on...

So the `Match` column just tells you whether or not a match exists in the whole data frame? You don't care which row(s) match, just want to know whether there is a match or not? And the `Blood` column does not matter for the purposes of this question? (Or do you only look for matches within the same blood type?) — Gregor Thomas, Apr 24 '19 at 02:19
This looks like a bit of clustering (`hclust`, whatever), in that if observations within a particular `Blood` group are close-enough to the rest of the group, they match. — r2evans, Apr 24 '19 at 02:27
People usually use `kmeans` clustering in the first instance, which is included in the `stats` package in Base R. The `dbscan` package is excellent too, but is non-parametric so maybe not quite what you want. — Simon Woodward, Apr 24 '19 at 02:36
How do you determine that A1 and A2 are matches but A2 and A3 are not? Or is the output not based on calculations? — Jon Spring, Apr 24 '19 at 03:45
@Gregor, sorry I am a bit late. This is 1:1:1:1 match. For every row in BloodGroup = A, I am interested in finding atleast one match in BloodGroup = B, BloodGroup= AB, BloodGroup=O — Emily Fassbender, Apr 24 '19 at 16:29
Please edit your question to say more about this, as it sounds like an important requirement that is not clear (or present at all?) in the current wording of the question. — Jon Spring, Apr 24 '19 at 16:38
@EmilyFassbender after reading your question, that's a very surprising requirement! As Jon says, please edit your question to make this clear in the text, not just implied by the column names of your new expected output. Can you also explain the meaning of the `Match` column in your expected output? Does a row with `Match = "yes"` mean that it has *at least one match* in another group? Or perhaps that it has at least one matches in *every other group*? Or something else? — Gregor Thomas, Apr 24 '19 at 17:26

Jon Spring · Answer 1 · 2019-04-24T07:06:43.657

Here's an approach using fuzzyjoin::distance_inner_join. The join should be very fast, but we'll need to filter out self-matches and coincidental matches with different Blood values.

df %>% 
  fuzzyjoin::distance_inner_join(df, by = c("x1", "x2", "x3", "x4"), 
                                max_dist = 0.02) %>%
  filter(Observation.x != Observation.y,
         Blood.x == Blood.y)

The output shows all the observations with their similar-enough matches:

   Observation.x Blood.x x1.x x2.x x3.x x4.x Observation.y Blood.y x1.y x2.y x3.y x4.y
1              1       A 0.01 0.16 0.31 0.46             2       A 0.02 0.17 0.32 0.47
2              2       A 0.02 0.17 0.32 0.47             1       A 0.01 0.16 0.31 0.46
3              2       A 0.02 0.17 0.32 0.47             3       A 0.03 0.18 0.33 0.48
4              3       A 0.03 0.18 0.33 0.48             2       A 0.02 0.17 0.32 0.47
5              4       B 0.05 0.20 0.35 0.49             5       B 0.06 0.21 0.36 0.50
6              5       B 0.06 0.21 0.36 0.50             4       B 0.05 0.20 0.35 0.49
7              8      AB 0.10 0.25 0.40 0.53             9      AB 0.11 0.26 0.41 0.54
8              9      AB 0.11 0.26 0.41 0.54             8      AB 0.10 0.25 0.40 0.53
9             10       O 0.13 0.28 0.43 0.55            11       O 0.14 0.29 0.44 0.56
10            11       O 0.14 0.29 0.44 0.56            10       O 0.13 0.28 0.43 0.55
11            11       O 0.14 0.29 0.44 0.56            12       O 0.15 0.30 0.45 0.57
12            12       O 0.15 0.30 0.45 0.57            11       O 0.14 0.29 0.44 0.56

And this output could be brought back in to get output in the requested format:

df %>% 
  fuzzyjoin::distance_inner_join(df, by = c("x1", "x2", "x3", "x4"), 
                                 max_dist = 0.02) %>%
  filter(Observation.x != Observation.y,
         Blood.x == Blood.y) %>%
  select(Observation.x, Blood.x) %>%
  rename(Observation = Observation.x,
         Blood = Blood.x) %>%
  mutate(Match = "Yes") %>%
  right_join(df) %>%
  replace_na(list(Match = "No"))

Joining, by = c("Observation", "Blood")
   Observation Blood Match   x1   x2   x3   x4
1            1     A   Yes 0.01 0.16 0.31 0.46
2            2     A   Yes 0.02 0.17 0.32 0.47
3            2     A   Yes 0.02 0.17 0.32 0.47
4            3     A   Yes 0.03 0.18 0.33 0.48
5            4     B   Yes 0.05 0.20 0.35 0.49
6            5     B   Yes 0.06 0.21 0.36 0.50
7            6     B    No 0.07 0.22 0.37 0.51
8            7    AB    No 0.09 0.24 0.39 0.52
9            8    AB   Yes 0.10 0.25 0.40 0.53
10           9    AB   Yes 0.11 0.26 0.41 0.54
11          10     O   Yes 0.13 0.28 0.43 0.55
12          11     O   Yes 0.14 0.29 0.44 0.56
13          11     O   Yes 0.14 0.29 0.44 0.56
14          12     O   Yes 0.15 0.30 0.45 0.57

will this identify 1:1:1:1 match, meaning, for every row in BloodGroup = A, finding atleast one match in BloodGroup = B, BloodGroup= AB, BloodGroup=O ? — Emily Fassbender, Apr 24 '19 at 16:31
The `max_dist` determines the threshold for a match. (I changed to 0.02 so some would not match.) So it could fine 0, 1, or multiple matches. The initial join finds matches across all blood types, but I included a filter step `Blood.x == Blood.y` on the assumption that one would not want to include matches between two observations for different types. — Jon Spring, Apr 24 '19 at 16:36

r match rows that that are similar

1 Answers1