If you are concerned about memory, then Matrix
may not be the answer, for two reasons:
Distance matrices are not sparse. There are n*(n-1)/2
nonredundant elements in an n
-by-n
distance matrix, and all of them can be nonzero. Asymptotically, that is half of the elements! It is inefficient to store these data in a sparseMatrix
object, because, in addition to the nonzero elements, you need to store their positions in the matrix. Two integer vectors i
and j
of length n*(n-1)/2
will occupy at least 4*n*(n-1)
bytes in memory (~10 gigabytes when n = 5e+04
).
Matrix
implements class dspMatrix
for efficient storage of dense symmetric matrices, including distance matrices. But whereas dist
objects store the n*(n-1)/2
elements below the diagonal, dspMatrix
objects store those elements and the diagonal elements. So you can't coerce from dist
to dspMatrix
without allocating 4*n*(n+1)
bytes (again, ~10 gigabytes when n = 5e+04
) for a new n*(n+1)/2
-length double vector.
The most efficient approach is to preserve the dist
object and index it directly, as needed for whatever computation you are doing.
You can take advantage of the fact that element [i, j]
in the lower triangle of an n
-by-n
distance matrix is stored in element [k]
of the corresponding dist
object, where k = i + (2 * (n - 1) - j) * (j - 1) / 2
.
For example, to get column (or row) j
of the distance matrix specified by a dist
object x
without constructing the entire matrix, you could use this function:
getDistCol <- function(x, j) {
p <- length(x)
n <- as.integer(round(0.5 * (1 + sqrt(1 + 8 * p)))) # p = n * (n - 1) / 2
if (j == 1L) {
return(c(0, x[seq_len(n - 1L)]))
}
ii <- rep.int(j - 1L, j - 1L)
jj <- 1L:(j - 1L)
if (j < n) {
ii <- c(ii, j:(n - 1L))
jj <- c(jj, rep.int(j, n - j))
}
kk <- ii + ((2L * (n - 1L) - jj) * (jj - 1L)) %/% 2L
res <- double(n)
res[-j] <- x[kk]
res
}
fruits <- c("apple", "banana", "ananas", "apple", "ananas", "apple", "ananas")
x <- stringdist::stringdistmatrix(fruits)
## 1 2 3 4 5 6
## 2 5
## 3 5 2
## 4 0 5 5
## 5 5 2 0 5
## 6 0 5 5 0 5
## 7 5 2 0 5 0 5
getDistCol(x, 1L)
## [1] 0 5 5 0 5 0 5
lapply(1:7, getDistCol, x = x)
## [[1]]
## [1] 0 5 5 0 5 0 5
##
## [[2]]
## [1] 5 0 2 5 2 5 2
##
## [[3]]
## [1] 5 2 0 5 0 5 0
##
## [[4]]
## [1] 0 5 5 0 5 0 5
##
## [[5]]
## [1] 5 2 0 5 0 5 0
##
## [[6]]
## [1] 0 5 5 0 5 0 5
##
## [[7]]
## [1] 5 2 0 5 0 5 0
If you insist on a dspMatrix
object, then you can use this method to coerce from dist
:
library("Matrix")
asDspMatrix <- function(x) {
n <- attr(x, "Size")
i <- 1L + c(0L, cumsum(n:2L))
xx <- double(length(x) + n)
xx[-i] <- x
new("dspMatrix", uplo = "L", x = xx, Dim = c(n, n))
}
asDspMatrix(x)
## 7 x 7 Matrix of class "dspMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0 5 5 0 5 0 5
## [2,] 5 0 2 5 2 5 2
## [3,] 5 2 0 5 0 5 0
## [4,] 0 5 5 0 5 0 5
## [5,] 5 2 0 5 0 5 0
## [6,] 0 5 5 0 5 0 5
## [7,] 5 2 0 5 0 5 0