I am dealing with few observations like this below. My goal is to identify rows that match/similar to each other based on a Euclidean distance concept, considering vector {x1,x2,x3,x4}
and threshold 0.2. Any distance between rows that are less than 0.2 are considered similar.
Observation Blood x1 x2 x3 x4
1 A 0.01 0.16 0.31 0.46
2 A 0.02 0.17 0.32 0.47
3 A 0.03 0.18 0.33 0.48
4 B 0.05 0.20 0.35 0.49
5 B 0.06 0.21 0.36 0.50
6 B 0.07 0.22 0.37 0.51
7 AB 0.09 0.24 0.39 0.52
8 AB 0.1 0.25 0.4 0.53
9 AB 0.11 0.26 0.41 0.54
10 O 0.13 0.28 0.43 0.55
11 O 0.14 0.29 0.44 0.56
12 O 0.15 0.3 0.45 0.57
I can do this using a very clunky double forloop. I am wondering if there is an efficient way to accomplish this.
Expected Output
Observation Blood x1 x2 x3 x4 Match
1 A 0.01 0.16 0.31 0.46 Yes
2 A 0.02 0.17 0.32 0.47 Yes
3 A 0.03 0.18 0.33 0.48 No
4 B 0.05 0.20 0.35 0.49 Yes
5 B 0.06 0.21 0.36 0.50 Yes
6 B 0.07 0.22 0.37 0.51 No
7 AB 0.09 0.24 0.39 0.52 No
8 AB 0.1 0.25 0.4 0.53 Yes
9 AB 0.11 0.26 0.41 0.54 No
10 O 0.13 0.28 0.43 0.55 No
11 O 0.14 0.29 0.44 0.56 Yes
12 O 0.15 0.3 0.45 0.57 Yes
Match Dataset
RowToBeMatched FoundMatches_Bgroup_B FoundMatches_Bgroup_AB FoundMatches_Bgroup_O
1 4 8 11
2 5 NA 12
So on...