2

I have a huge dataset and that look like this. To save some memory I want to calculate the pairwise distance but leave the upper diagonal of the matrix to NULL.

library(tidyverse)
library(stringdist)
#> 
#> Attaching package: 'stringdist'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

df3 <- tibble(fruits=c("apple","banana","ananas","apple","ananas","apple","ananas"),
              position=c("135","135","135","136","137","138","138"), 
              counts = c(100,200,100,30,40,50,100))

stringdistmatrix(df3$fruits, method=c("osa"), nthread = 4) %>% 
  as.matrix()
#>   1 2 3 4 5 6 7
#> 1 0 5 5 0 5 0 5
#> 2 5 0 2 5 2 5 2
#> 3 5 2 0 5 0 5 0
#> 4 0 5 5 0 5 0 5
#> 5 5 2 0 5 0 5 0
#> 6 0 5 5 0 5 0 5
#> 7 5 2 0 5 0 5 0

Created on 2022-03-01 by the reprex package (v2.0.1)

However when I convert my stringdistmatrix to matrix (This step is essential for me), my upper diagonal get filled with numbers.

Is there anyway to convert to matrix but keep upper diagonal to NULL and save memory?

I want my data to look like this

  1 2 3 4 5 6
2 5          
3 5 2        
4 0 5 5      
5 5 2 0 5    
6 0 5 5 0 5  
7 5 2 0 5 0 5
LDT
  • 2,856
  • 2
  • 15
  • 32

2 Answers2

3

If you are concerned about memory, then Matrix may not be the answer, for two reasons:

  • Distance matrices are not sparse. There are n*(n-1)/2 nonredundant elements in an n-by-n distance matrix, and all of them can be nonzero. Asymptotically, that is half of the elements! It is inefficient to store these data in a sparseMatrix object, because, in addition to the nonzero elements, you need to store their positions in the matrix. Two integer vectors i and j of length n*(n-1)/2 will occupy at least 4*n*(n-1) bytes in memory (~10 gigabytes when n = 5e+04).

  • Matrix implements class dspMatrix for efficient storage of dense symmetric matrices, including distance matrices. But whereas dist objects store the n*(n-1)/2 elements below the diagonal, dspMatrix objects store those elements and the diagonal elements. So you can't coerce from dist to dspMatrix without allocating 4*n*(n+1) bytes (again, ~10 gigabytes when n = 5e+04) for a new n*(n+1)/2-length double vector.

The most efficient approach is to preserve the dist object and index it directly, as needed for whatever computation you are doing. You can take advantage of the fact that element [i, j] in the lower triangle of an n-by-n distance matrix is stored in element [k] of the corresponding dist object, where k = i + (2 * (n - 1) - j) * (j - 1) / 2.

For example, to get column (or row) j of the distance matrix specified by a dist object x without constructing the entire matrix, you could use this function:

getDistCol <- function(x, j) {
    p <- length(x)
    n <- as.integer(round(0.5 * (1 + sqrt(1 + 8 * p)))) # p = n * (n - 1) / 2
    if (j == 1L) {
        return(c(0, x[seq_len(n - 1L)]))
    }
    ii <- rep.int(j - 1L, j - 1L)
    jj <- 1L:(j - 1L)
    if (j < n) {
        ii <- c(ii, j:(n - 1L))
        jj <- c(jj, rep.int(j, n - j))
    }
    kk <- ii + ((2L * (n - 1L) - jj) * (jj - 1L)) %/% 2L
    res <- double(n)
    res[-j] <- x[kk]
    res
}
fruits <- c("apple", "banana", "ananas", "apple", "ananas", "apple", "ananas")
x <- stringdist::stringdistmatrix(fruits)
##   1 2 3 4 5 6
## 2 5          
## 3 5 2        
## 4 0 5 5      
## 5 5 2 0 5    
## 6 0 5 5 0 5  
## 7 5 2 0 5 0 5

getDistCol(x, 1L)
## [1] 0 5 5 0 5 0 5

lapply(1:7, getDistCol, x = x)
## [[1]]
## [1] 0 5 5 0 5 0 5
## 
## [[2]]
## [1] 5 0 2 5 2 5 2
## 
## [[3]]
## [1] 5 2 0 5 0 5 0
## 
## [[4]]
## [1] 0 5 5 0 5 0 5
## 
## [[5]]
## [1] 5 2 0 5 0 5 0
## 
## [[6]]
## [1] 0 5 5 0 5 0 5
## 
## [[7]]
## [1] 5 2 0 5 0 5 0

If you insist on a dspMatrix object, then you can use this method to coerce from dist:

library("Matrix")
asDspMatrix <- function(x) {
    n <- attr(x, "Size")
    i <- 1L + c(0L, cumsum(n:2L))
    xx <- double(length(x) + n)
    xx[-i] <- x
    new("dspMatrix", uplo = "L", x = xx, Dim = c(n, n))
}
asDspMatrix(x)
## 7 x 7 Matrix of class "dspMatrix"
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    0    5    5    0    5    0    5
## [2,]    5    0    2    5    2    5    2
## [3,]    5    2    0    5    0    5    0
## [4,]    0    5    5    0    5    0    5
## [5,]    5    2    0    5    0    5    0
## [6,]    0    5    5    0    5    0    5
## [7,]    5    2    0    5    0    5    0
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
1

I think you may need to use sparse matrices. Package Matrix has such a possibility. You can learn more about sparse matrices at: Sparse matrix

library(Matrix)

m <- sparseMatrix(i = c(1:3, 2:3, 3), j=c(1:3,1:2, 1), x = 1, triangular = T)

m

#> 3 x 3 sparse Matrix of class "dtCMatrix"
#>           
#> [1,] 1 . .
#> [2,] 1 1 .
#> [3,] 1 1 1

To check the size of the matrices, one can use function object.size.

It seems that for small matrices, using sparse matrices makes no difference, but, for large matrices, the memory savings are considerable:

library(Matrix)

n <- 30
m1 <- matrix(1,n,n)
m2 <- Matrix(m1, sparse = TRUE) 

object.size(m1)
#> 7416 bytes

object.size(m2)
#> 7432 bytes

n <- 300
m1 <- matrix(1,n,n)
m2 <- Matrix(m1, sparse = TRUE) 

object.size(m1)
#> 720216 bytes

object.size(m2)
#> 544728 bytes
PaulS
  • 21,159
  • 2
  • 9
  • 26