1

I have a vector of distances which I get from some other procedure and want to convert it to a dist object in the R language .

Below I give an example how such a vector looks like: distVector is computed in the same way said other procedure computes the distance vector. Ideally, I would like to transform this vector into a distance matrix (dist object) without wasting resources.

I think I could just transform it to a matrix by copying it as upper triangular and lower triangular matrix, setting the diagonals to 0, and dealing with the fact that it is sort of upside down compared to the dist object structure (compare outputs below). Then again, first creating a full matrix and then (probably?) reducing it again to a vector in the dist object seems wasteful to me. Is there a better way?

Example code (note: I cannot change how distVector is computed):

rawData<-matrix(c(1,1,1,1.1,1,1,1,1,1.2,2,2,2,2.2,2,2,2,2.2,2.2,3,3,3,3.4,3,3),ncol=3,byrow=TRUE);

distVector<-integer(0);
for(i in 1:dim(rawData)[1]) {
  for(j in (i+1):dim(rawData)[1]) {
    a <- (rawData[i,]-rawData[j,]);
    distVector <- c(distVector, sqrt(a %*% a));
  }
}

print(distVector)
print(dist(rawData))

Output: Compare distVector to the output of the dist function, it is upside down)

> print(distVector)
 [1] 0.1000000 0.2000000 1.7320508 1.8547237 1.9697716 3.4641016 3.7094474 0.2236068 1.6763055
[10] 1.7916473 1.9209373 3.4073450 3.6455452 1.6248077 1.7549929 1.8547237 3.3526109 3.6055513
[19] 0.2000000 0.2828427 1.7320508 1.9899749 0.3464102 1.6248077 1.8547237 1.5099669 1.8000000
[28] 0.4000000

> print(dist(rawData))
          1         2         3         4         5         6         7
2 0.1000000                                                            
3 0.2000000 0.2236068                                                  
4 1.7320508 1.6763055 1.6248077                                        
5 1.8547237 1.7916473 1.7549929 0.2000000                              
6 1.9697716 1.9209373 1.8547237 0.2828427 0.3464102                    
7 3.4641016 3.4073450 3.3526109 1.7320508 1.6248077 1.5099669          
8 3.7094474 3.6455452 3.6055513 1.9899749 1.8547237 1.8000000 0.4000000

Many thanks, Thomas.

Thomas Weise
  • 389
  • 6
  • 13
  • 2
    creating the matrix seems reasonable to me ... `mat <- matrix(NA, ncol=dim(rawData)[1], nrow=dim(rawData)[1]) ; mat[lower.tri(mat)] <- distVector` – user20650 Dec 16 '15 at 02:46
  • That's a short and nice answer. I am not yet familiar with `R`. This I wonder how the memory consumption of this would be? Will this create a full m*m matrix, fill its lower triangle with my data, and then extract this lower triangle again in my subsequent `as.dist` call? Or would it just somehow create an empty container for a m*m matrix or something, not consuming much memory? Either way, your method seems to be feasible to me. Thanks ^_^ – Thomas Weise Dec 16 '15 at 02:50
  • Sorry i dont know about R's memory usage / copying of objects etc . There a quite a few questions on SO that have looked at such stuffs though -- [this](http://stackoverflow.com/questions/23898969/is-data-really-copied-four-times-in-rs-replacement-functions) was my fist search hit.. https://stat.ethz.ch/R-manual/R-devel/library/base/html/tracemem.html might be useful – user20650 Dec 16 '15 at 02:57
  • You could hide some of the looping code with `combn(1:nrow(rawData),2, function(x) {o <- rawData[x[1],] - rawData[x[2],]; c(sqrt(o %*% o))} )` – thelatemail Dec 16 '15 at 03:03
  • 1
    @user20650: ugh, that's much to read ^_^. Well, unless there is a simpler, non-matrix-creating solution, I will take your suggestion, as it is very compact code-wise. In case no non-matrix-creating-solution shows up until tomorrow, if you want you can copy-paste your comment as answer and I will accept it. (Basically, it is an answer.) – Thomas Weise Dec 16 '15 at 03:08
  • @thelatemail: I get `distVector` from elsewhere, the code here is just an example so people can how it looks like and what its elements represent. Thus, this part of the code does not need to be optimized. Still, being new to `R`, from your comment, I learned something useful and new to me. Thanks for showing the `combn` function and this inline-function-definition method :-) – Thomas Weise Dec 16 '15 at 03:10
  • ha.. lots to read but its good stuff ;p. It might be worth emphasising in your question title, that you are looking for an efficient way. Some people on SO / r tag that are knowledgeable of such stuffs – user20650 Dec 16 '15 at 03:14
  • 1
    @user20650: noted, title changed. Thanks again. – Thomas Weise Dec 16 '15 at 03:19
  • Another [thread](http://stackoverflow.com/questions/34281593/large-distance-matrix-in-clustering/34281930#34281930) may be helpful, especially `dist` function can be performed w/ multithread easily. – Patric Dec 16 '15 at 04:51

0 Answers0