What method to find n most similar/related vectors within N numeric vectors in R? (N > n)

Question

I have these 10 numeric vectors. For simplicity, each containing 5 elements

a <- c(1,2,3,4,5)
b <- c(1,2,3,4,6)
c <- c(1,2,3,4,6)
d <- c(1,2,3,4,6)
e <- c(6,2,9,7,3)
f <- c(7,3,5,7,6)
g <- c(7,9,3,4,0)
h <- c(4,6,4,6,9)
i <- c(8,8,5,3,8)
j <- c(2,1,1,2,3)

I want to find 3 most related/similar vectors. It must be vector b, c, d.

Additionally, I also hoping to get another vectors composition besides the "most related" one (b, c, d). In this case, could be: (a, b, c), (a, b, d), (a, c, d ). The level of relation/similarity itself should have score so I can find the most similar, second most similar etc.

Expected output is like this, more or less

similarity_rank   vectors   similarity_score (example)
1                 b, c, d   0.99
2                 a, b, c   0.8
etc.

My trial so far: I'm using pairwise correlation. It can find the relation between vectors but only 2 vectors. I want to get "similarity score" for those 3 vector (or for general purpose, n vectors)

Rules:

n: Number of desired vectors
N: Number of all vectors
N > n
All vectors are numeric

Question: What is the best method to do that? (R code will be amazing, R Package will be great, or only the method name is enough so I can learn about it)

score 0 · Answer 1 · edited Oct 24 '22 at 10:51

0

As you told that you can find the correlation between 2 vectors. You can store the result of correlation of every pair of your input numeric vector. It will be operation of O(n^2). Now you have score of every pair so you can create every set of three vectors and can check average of all three pairs of every set and can output the result according to that. For eg:- take set (a, b, c) you have three possible pairs here (a, b), (a, c) and (b, c). use the correlation score of pairs and take a average and store them in output in ascending order. That will be your result.

edited Oct 24 '22 at 10:51

user438383

5,716
8
28
43

answered Oct 24 '22 at 04:44

goodwin

49
4

Oh I see the logic. For the time being, this will do. Thanks. Edit: I'm still looking for another more efficient way tho – isaid-hi Oct 24 '22 at 06:02

jblood94 · Accepted Answer · 2022-10-25T15:08:38.567

Put each vector as columns in a matrix, then calculate the cosine similarity of each column pair using crossprod as in this answer. Then you could find the maximum n values in each column.

v <- mapply(get, letters[1:10], mode = "numeric")
crossprod(v)/(sqrt(tcrossprod(colSums(v^2))))*(1 - diag(ncol(v)))
#>           a         b         c         d         e         f         g         h         i         j
#> a 0.0000000 0.9958592 0.9958592 0.9958592 0.8062730 0.8946692 0.5415304 0.9616223 0.8162174 0.9280323
#> b 0.9958592 0.0000000 1.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> c 0.9958592 1.0000000 0.0000000 1.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> d 0.9958592 1.0000000 1.0000000 0.0000000 0.7636241 0.8736978 0.4943473 0.9592858 0.8106045 0.9318911
#> e 0.8062730 0.7636241 0.7636241 0.7636241 0.0000000 0.9226539 0.6904075 0.7748305 0.7656671 0.7887775
#> f 0.8946692 0.8736978 0.8736978 0.8736978 0.9226539 0.0000000 0.7374396 0.9189132 0.8929772 0.9557896
#> g 0.5415304 0.4943473 0.4943473 0.4943473 0.6904075 0.7374396 0.0000000 0.6968355 0.8281550 0.6265219
#> h 0.9616223 0.9592858 0.9592858 0.9592858 0.7748305 0.9189132 0.6968355 0.0000000 0.9292092 0.9614179
#> i 0.8162174 0.8106045 0.8106045 0.8106045 0.7656671 0.8929772 0.8281550 0.9292092 0.0000000 0.9003699
#> j 0.9280323 0.9318911 0.9318911 0.9318911 0.7887775 0.9557896 0.6265219 0.9614179 0.9003699 0.0000000

Comparing to correlation using cor:

cor(v)*(1 - 2*diag(ncol(v)))
#>             a          b          c          d           e           f          g           h          i          j
#> a -1.00000000  0.9863939  0.9863939  0.9863939 -0.05488213  0.18898224 -0.8565862  0.77151675 -0.3434014  0.5669467
#> b  0.98639392 -1.0000000  1.0000000  1.0000000 -0.15338363  0.18641093 -0.8745781  0.83712138 -0.1919465  0.6524383
#> c  0.98639392  1.0000000 -1.0000000  1.0000000 -0.15338363  0.18641093 -0.8745781  0.83712138 -0.1919465  0.6524383
#> d  0.98639392  1.0000000  1.0000000 -1.0000000 -0.15338363  0.18641093 -0.8745781  0.83712138 -0.1919465  0.6524383
#> e -0.05488213 -0.1533836 -0.1533836 -0.1533836 -1.00000000  0.45635690 -0.2276335 -0.66054273 -0.7086322 -0.2696654
#> f  0.18898224  0.1864109  0.1864109  0.1864109  0.45635690 -1.00000000 -0.4174789 -0.02916059 -0.3374632  0.6428571
#> g -0.85658615 -0.8745781 -0.8745781 -0.8745781 -0.22763353 -0.41747888 -1.0000000 -0.53565298  0.2415150 -0.6304783
#> h  0.77151675  0.8371214  0.8371214  0.8371214 -0.66054273 -0.02916059 -0.5356530 -1.00000000  0.2331471  0.6998542
#> i -0.34340141 -0.1919465 -0.1919465 -0.1919465 -0.70863219 -0.33746319  0.2415150  0.23314715 -1.0000000  0.1817109
#> j  0.56694671  0.6524383  0.6524383  0.6524383 -0.26966544  0.64285714 -0.6304783  0.69985421  0.1817109 -1.0000000

Basically, the logic is the same as the previous answer right? — isaid-hi, Oct 25 '22 at 04:43
As far as getting paired scores, yes; however, instead of correlation, this answer uses cosine similarity, which may work better for whatever it is you're doing. — jblood94, Oct 25 '22 at 14:49
For example, consider what would happen if `d` were changed to `c(2,3,4,5,7)`. The correlation between `b` and `d` and `c` and `d` would both still be `1`, but the cosine similarity scores would be less than 1. — jblood94, Oct 25 '22 at 15:08
Aah I see. I read about cosine similarity and kinda hard to spot the difference before. This is the best way so far. Thank you for the answer — isaid-hi, Oct 26 '22 at 08:43

What method to find n most similar/related vectors within N numeric vectors in R? (N > n)

2 Answers2