-2

I want to calculate Euclidean distance between 12 populations, in each population there are 20 samples and each sample is measured for 100 genes (these are microarray data; the numbers here are just examples).

The equation I found is:

distance = sqrt{[sum(Average of xi -average of yi)^2] /n }, i=1 to n;

where xi and yi are the expression of gene i over two populations with p and q samples (x1, x2,…,xp), (y1, y2,…,yq), n is the number of genes.

part of data are pasted below

row.names pop1.1    pop1.2  pop1.3  pop1.4  pop2.1  pop2.2  pop2.3  pop2.4
7A5     5.38194 4.06191 4.88044 5.60383 6.23101 6.53738 4.80336 5.86136
A1BG    5.15155 4.29441 4.59131 4.90026 4.62908 4.48712 4.73039 4.46208
A1CF    4.22396 4.14451 4.41465 3.93179 4.89638 4.66109 4.20918 4.48107
A26C3   12.1969 12.4179 10.9786 11.7659 11.405  11.7594 11.1757 11.8128

How might one calculate these distances in R with this data structure?

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • 3
    http://stackoverflow.com/questions/5559384/how-to-find-the-euclidean-distance-of-two-vectors-in-r – Marco Aurelio Sep 04 '14 at 18:44
  • We don't do favours here, we suggest answers to precise problems. You could, for example, try googling after all you 'found' the distance equation. – user3791372 Sep 04 '14 at 18:44
  • The paste of the data is nice but use of a directly reproducible dataset would be much nicer – Hack-R Sep 04 '14 at 19:16
  • Re-open note: neither of the proposed duplicates (the linked one which had been closed as a duplicate or the link form that one) address the question of how to create all the pairwise distances. – IRTFM Sep 04 '14 at 21:55

1 Answers1

0

A Euclidean distance is the sqrt of the sum of the squared difference over each dimension. (NOT the sqrt of that difference in averages.) Something like:

 apply( combn(2:length(dat),2 ) , 2, function(x){
             sqrt( sum( (dat[[ x[1] ]]- dat[[ x[2] ]])^2 ) ) }
       )
#------------------
 [1] 1.5913270 1.4442952 0.6192787 1.4398434 1.4693528 1.2470760
 [7] 0.9585670 1.7037315 1.7930213 2.5314568 2.6202225 1.5123061
[13] 1.9353429 1.2131522 1.4964447 1.8511313 0.3261149 1.2958417
[19] 1.2359515 1.2546398 1.0463696 0.7498210 0.5431169 1.6041417
[25] 0.7094458 1.9002458 0.7020604 1.2927393

If you wanted to know which columns were being compared:

 cbind( t(combn(2:length(dat),2 )), 
        apply( combn(2:length(dat),2 ) , 2, function(x){
          sqrt( sum( (dat[[ x[1] ]]- dat[[ x[2] ]])^2 ) ) }
        ) 
    )
IRTFM
  • 258,963
  • 21
  • 364
  • 487