0

I have this dataframe called mydf where I have three principal covariates (PCA.1,PCA.2, PCA.3). I want to get the 3d distance matrix and get the shortest euclidean distance between all the compared Samples. In another dataframe called myref, I have some known identity of Samples and some unknown samples. By calculating the shortest euclidean distance from mydf, I want to assign the known Identity to the unknown samples. Can someone please help me get this done.

mydf

mydf <- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8", 
"9", "10", "12"), PCA.1 = c(0.00338, -0.020373, -0.019842, -0.019161, 
-0.019594, -0.019728, -0.020356, 0.043339, -0.017559, -0.020657
), PCA.2 = c(0.00047, -0.010116, -0.011532, -0.011582, -0.013245, 
-0.011751, -0.010299, -0.005801, -0.01, -0.011334), PCA.3 = c(-0.008787, 
0.001412, 0.003751, 0.00371, 0.004242, 0.003738, 0.000592, -0.037229, 
0.004307, 0.00339)), .Names = c("Sample", "PCA.1", "PCA.2", "PCA.3"
), row.names = c(NA, 10L), class = "data.frame")

myref

myref<- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8", 
"9", "10", "12"), Identity = c("apple", "unknown", "ball", "unknown", 
"unknown", "car", "unknown", "cat", "unknown", "dog")), .Names = c("Sample", 
"Identity"), row.names = c(NA, 10L), class = "data.frame")
MAPK
  • 5,635
  • 4
  • 37
  • 88

1 Answers1

1
uIX = which(myref$Identity == "unknown")
dMat = as.matrix(dist(mydf[, -1])) # Calculate the Euclidean distance matrix
nn = apply(dMat, 1, order)[2, ] # For each row of dMat order the values increasing values. 
                                # Select nearest neighbor (it is 2, because 1st row will be self)
myref$Identity[uIX] = myref$Identity[nn[uIX]]

Note that the above code will set some identities to unknown. If instead you want to match to the nearest neighbor with a known identity, change the second line to

dMat[uIX, uIX] = Inf
jMathew
  • 1,057
  • 8
  • 13
  • why does it set some to unknown? Could you explain your code please? – MAPK Mar 08 '16 at 11:29
  • I have added some comments. Hope they explain the code. – jMathew Mar 09 '16 at 05:57
  • 1
    If you calculate the distance of rows in `mydf`, you will see that some of the nearest neighbors are `unknown`. For example nearest neighbor of Sample 2 is Sample 8 which is `unknown` – jMathew Mar 09 '16 at 06:02