1

I have a dataframe with many rows, and each row contains a sample ID and two samples, which I am treating as coordinates. I want to calculate the euclidean distance between each set of coordinates (i.e., each row) to generate a distance matrix comparing each sample. I'm having trouble using dist because it seems like I should be subdividing my dataframe or comparing two separate ones, and I'm not looking for pairwise comparisons of x and y; I just want to know the distance between each sample in my dataframe.

Here is an example dataframe:

sample <- c("s1","s2","s3")
x <- c(12,10,5)
y <- c(8,6,15)
df <- data.frame(sample, x, y)

which I would like to produce a 3x3 matrix of distances. This seems like it should be easy to do, I might just be missing a keyword.

2 Answers2

2

dist(df[,-1]) returns a matrix-like object of class "dist":

dist(df[,-1])
#           1         2
# 2  2.828427          
# 3  9.899495 10.295630
class(dist(df[,-1]))
# [1] "dist"

which presents the lower triangle and no diagonal for all pairwise distances.

If you want a full 3x3 including the diagonal and full repeat on the upper triangle, then you can do

as.matrix(dist(df[,-1]))
#          1         2         3
# 1 0.000000  2.828427  9.899495
# 2 2.828427  0.000000 10.295630
# 3 9.899495 10.295630  0.000000

class(as.matrix(dist(df[,-1])))
# [1] "matrix" "array" 

and it is a "normal matrix".

r2evans
  • 141,215
  • 6
  • 77
  • 149
1

You can try dist

with(
    df,
    dist(`row.names<-`(cbind(x, y), sample), diag = TRUE, upper = TRUE)
)

which gives

          s1        s2        s3
s1  0.000000  2.828427  9.899495
s2  2.828427  0.000000 10.295630
s3  9.899495 10.295630  0.000000
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81