How to find and name contiguous non-zero entries in a sparse matrix in R?

Question

My problem is conceptually simple. I am looking for a computationally efficient solution of it (my own one I attach at the end).

Suppose we have a potentially very large sparse matrix like the one on the left below and want to 'name' every area of contiguous non-zero elements with a separate code (see matrix on the right)

1 1 1 . . . . .          1 1 1 . . . . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
. . . . 1 1 . .   --->   . . . . 4 4 . .
. . 1 1 . . 1 1          . . 3 3 . . 7 7
1 . 1 1 . . 1 1          2 . 3 3 . . 7 7
1 . . . 1 . . .          2 . . . 5 . . .
1 . . . . 1 1 1          2 . . . . 6 6 6

In my application the contiguous elements would form rectangles, lines or single points and they can only touch each other with the vertexes (i.e. there would be no irregular/non rectangular areas in the matrix).

The solution I imagined is to match the row and column indexes of the sparse matrix representation to a vector with the appropriate values (the 'name' codes). My solution uses several for loops and works fine for small to medium matrices but will fast get stuck in the loops as the dimensions of the matrix become large (>1000). It probably depends from the fact I'm not so advanced in R programming - I couldn't find any computational trick/function to solve it better.

Can anybody suggest a computationally more efficient way to do that in R?

My solution:

mySolution <- function(X){

  if (class(X) != "ngCMatrix") {stop("Input must be a Sparse Matrix")}
  ind <- which(X == TRUE, arr.ind = TRUE)
  r <- ind[,1]
  c <- ind[,2]

  lr <- nrow(ind)
  for (i in 1:lr) {
    if(i == 1) {bk <- 1}
    else {
      if (r[i]-r[i-1] == 1){bk <- c(bk, bk[i-1])}
      else {bk <- c(bk, bk[i-1]+1)}
    }
  }

  for (LOOP in 1:(lr-1)) {
    tr <- r[LOOP]
    tc <- c[LOOP]
    for (j in (LOOP+1):lr){
      if (r[j] == tr) {
        if(c[j] == tc + 1) {bk[j] <- bk[LOOP]} 
      }
    }
  }

  val <- unique(bk)
  for (k in 1:lr){
    bk[k] <- which(val==bk[k])
  }

  return(sparseMatrix(i = r, j = c, x = bk))
}

Thanks in advance for any help or pointer.

Might be a good use case for `Rcpp`. Your code seems straightforward enough to translate — Aurèle, Feb 03 '17 at 14:52
@apom you might be right but I'm skimming through `Rcpp` vignettes right now, and if I need to know `C++` language as it seems, that's not for me unfortunately ... — GiuGe, Feb 03 '17 at 15:20
Assuming `m` is your "ngCMatrix", you could try finding a way to make something like `sm = summary(m); sparseMatrix(i = sm$i, j = sm$j, x = cutree(hclust(dist(sm, "maximum")), h = 2))` work appropriately — alexis_laz, Feb 03 '17 at 15:30
@alexis_laz, I'm impressed - so elegant. Chapeau, I would have never found such a solution. I'm running a few simulations but it seems to work. Thank you for the enlightenment. :) — GiuGe, Feb 04 '17 at 11:58
... well, actually I talked a bit too early. I will still try to make it work appropriately but the clustering of areas joined by the vertex, or even larger than distance = 2 from border to border, simply misbehave... — GiuGe, Feb 05 '17 at 12:57
@GiuGe : I know; couldn' find a way to make it work. On the same note, taking into account the "only rectangles" condition and, computing a custom distance `sm = as.matrix(summary(m)); d = as.dist(sapply(1:nrow(sm), function(i) rowSums(abs(sm[i, col(sm)] - sm))))`, using `sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = cutree(hclust(d, "single"), h = 1))` seems to work fine for the current example, though I tend to believe that there 'll be many cases that break it. — alexis_laz, Feb 05 '17 at 18:10
@alexis_laz Bingo! This is it! You can change the computation of `d` into `d <- dist(sm, method = "manhattan")` because this is basically what you're doing there. Single linkage aggregation and a cut of the three at height 1 effectively distinguish areas which 'touch' each other at the vertex (and hence are 2 steps away in term of distance): as I said in my question this is the situation I have in my data, so this is fine. A couple of simulations with my function and your method say your method is faster by a factor of 4 with matrices 200x200. Why don't you publish your answer below? Thank you — GiuGe, Feb 06 '17 at 12:10

alexis_laz · Accepted Answer · 2017-02-10T14:29:06.820

Relying heavily on the fact that all neighbouring elements to be grouped form only rectangles/lines/points, we see that elements of the matrix can be aggregated based on their [row, col] indices on the matrix by the relation (abs(row1 - row2) + abs(col1 - col2)) < 2.

So, starting with the [row, col] indices:

sm = as.matrix(summary(m))

We compute their distance, which -as noted by GiuGe- is actually the "manhattan" method:

d = dist(sm, "manhattan")

Single-linkage's property in clustering elements on their nearest neighbour is useful here. Also, we can get a grouping of the elements by cutreeing on "h = 1" (where the distance of indices is "< 2"):

gr = cutree(hclust(d, "single"), h = 1)

Finally, we can wrap the above in a new sparse matrix:

sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)
#8 x 8 sparse Matrix of class "dgCMatrix"
#                    
#[1,] 1 1 1 . . . . .
#[2,] 1 1 1 . 4 4 . .
#[3,] 1 1 1 . 4 4 . .
#[4,] . . . . 4 4 . .
#[5,] . . 3 3 . . 7 7
#[6,] 2 . 3 3 . . 7 7
#[7,] 2 . . . 5 . . .
#[8,] 2 . . . . 6 6 6

"m" used is:

library(Matrix)
m = new("ngCMatrix"
    , i = c(0L, 1L, 2L, 5L, 6L, 7L, 0L, 1L, 2L, 0L, 1L, 2L, 4L, 5L, 4L, 
5L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 7L, 4L, 5L, 7L, 4L, 5L, 7L)
    , p = c(0L, 6L, 9L, 14L, 16L, 20L, 24L, 27L, 30L)
    , Dim = c(8L, 8L)
    , Dimnames = list(NULL, NULL)
    , factors = list()
)

EDIT Feb 10 '17

Another idea (and, again, considering the fact that neighbouring elements form only rectangles/lines/points) is to iterate -in ascending columns- through the [row, col] indices and, in each step, find the distance of each element of its nearest neighbour in current column and row. If a "< 2" distance is found then the element is grouped with its neighbour, else a new group is started. Wrapped into a function:

ff = function(x) 
{
    sm = as.matrix(summary(x))

    gr = integer(nrow(sm)); ngr = 0L ; gr[1] = ngr 

    lastSeenRow = integer(nrow(x))
    lastSeenCol = integer(ncol(x))

    for(k in 1:nrow(sm)) {
        kr = sm[k, 1]; kc = sm[k, 2]
        i = lastSeenRow[kr]
        j = lastSeenCol[kc]

        if(i && (abs(kc - sm[i, 2]) == 1)) gr[k] = gr[i]
        else if(j && (abs(kr - sm[j, 1]) == 1)) gr[k] = gr[j]  
             else { ngr = ngr + 1L; gr[k] = ngr } 

        lastSeenRow[kr] = k
        lastSeenCol[kc] = k        
    }

    sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)                 
}

And applied on "m":

ff(m)
#8 x 8 sparse Matrix of class "dgCMatrix"
#                    
#[1,] 1 1 1 . . . . .
#[2,] 1 1 1 . 4 4 . .
#[3,] 1 1 1 . 4 4 . .
#[4,] . . . . 4 4 . .
#[5,] . . 3 3 . . 7 7
#[6,] 2 . 3 3 . . 7 7
#[7,] 2 . . . 5 . . .
#[8,] 2 . . . . 6 6 6

Also, it's convenient that both functions return groups in the same order, as we can check:

identical(mySolution(m), ff(m))
#[1] TRUE

On a, seemingly, more complex example:

mm = new("ngCMatrix"
    , i = c(25L, 26L, 27L, 25L, 29L, 25L, 25L, 17L, 18L, 26L, 3L, 4L, 5L, 
14L, 17L, 18L, 25L, 27L, 3L, 4L, 5L, 17L, 18L, 23L, 26L, 3L, 
4L, 5L, 10L, 17L, 18L, 9L, 11L, 17L, 18L, 10L, 17L, 18L, 3L, 
17L, 18L, 21L, 17L, 18L, 17L, 18L, 1L, 2L, 3L, 4L, 16L, 8L, 17L, 
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 7L, 9L, 10L, 11L, 26L, 
8L, 27L, 1L, 2L, 28L, 1L, 2L, 15L, 27L, 1L, 2L, 21L, 22L, 1L, 
2L, 7L, 21L, 22L, 1L, 2L, 6L, 24L, 1L, 2L, 5L, 11L, 16L, 25L, 
26L, 27L, 4L, 15L, 17L, 19L, 25L, 26L, 27L, 3L, 16L, 25L, 26L, 
27L, 2L, 28L, 1L)
    , p = c(0L, 0L, 3L, 3L, 5L, 6L, 7L, 7L, 10L, 18L, 25L, 31L, 35L, 38L, 
42L, 44L, 46L, 51L, 61L, 66L, 68L, 71L, 75L, 79L, 84L, 88L, 96L, 
103L, 108L, 110L, 111L)
    , Dim = c(30L, 30L)
    , Dimnames = list(NULL, NULL)
    , factors = list()
)
identical(mySolution(mm), ff(mm))
#[1] TRUE

And a simple benchmark on a larger matrix:

times = 30 # times `dim(mm)`
MM2 = do.call(cbind, rep_len(list(do.call(rbind, rep_len(list(mm), times))), times))
dim(MM2)
#[1] 900 900

system.time({ ans1 = mySolution(MM2) })
#   user  system elapsed 
# 449.50    0.53  463.26

system.time({ ans2 = ff(MM2) })
#   user  system elapsed 
#   0.51    0.00    0.52

identical(ans1, ans2)
#[1] TRUE

Thank you @alexis_laz. I've been trying this code in the past days. The results are correct, and it is faster than my solution. Yet, with veeeery big matrices (>1000x1000) it can still choke the RAM when building the distance matrix. So new approaches are also welcome. — GiuGe, Feb 10 '17 at 07:26
@GiuGe : Yeah, although "dist" does not save the whole distance matrix but its `lower.tri`, it -still- demands significant memory. I 've added another approach which, testing it on some cases, seems to be correct; hope i 've not missed something in the process. — alexis_laz, Feb 10 '17 at 12:14

How to find and name contiguous non-zero entries in a sparse matrix in R?

1 Answers1

Linked