r - Binding sparse matrices of different sizes on rows

Question

I am attempting to use the Matrix package to bind two sparse matrices of different size together. The binding is on rows, using the column names for matching.

Table A:

ID     | AAAA   | BBBB   |
------ | ------ | ------ |
XXXX   | 1      | 2      |

Table B:

ID     | BBBB   | CCCC   |
------ | ------ | ------ |
YYYY   | 3      | 4      |

Binding table A and B:

ID     | AAAA   | BBBB   | CCCC   |
------ | ------ | ------ | ------ |
XXXX   | 1      | 2      |        |
YYYY   |        | 3      | 4      |

The intention is to insert a large number of small matrices into a single large matrix, to enable continuous querying and update/inserts.

I find that neither the Matrix or slam packages have functionality to handle this.

Similar questions have been asked in the past, but it seems no solution has been found:

Post 1: in-r-when-using-named-rows-can-a-sparse-matrix-column-be-added-concatenated

Post 2: bind-together-sparse-model-matrices-by-row-names

Ideas on how to solve it will be highly appreciated.

Best regards,

Frederik

score 6 · Answer 1 · answered Sep 08 '18 at 14:15

For my purposes (very sparse matrix with millions of rows, and tens of thousands of columns, more than 99.9% of the values empty) this was still much too slow. What worked was the code below - might be helpful to others as well:

merge.sparse = function(listMatrixes) {
  # takes a list of sparse matrixes with different columns and adds them row wise

  allColnames <- sort(unique(unlist(lapply(listMatrixes,colnames))))
  for (currentMatrix in listMatrixes) {
    newColLocations <- match(colnames(currentMatrix),allColnames)
    indexes <- which(currentMatrix>0, arr.ind = T)
    newColumns <- newColLocations[indexes[,2]]
    rows <- indexes[,1]
    newMatrix <- sparseMatrix(i=rows,j=newColumns, x=currentMatrix@x,
                              dims=c(max(rows),length(allColnames)))
    if (!exists("matrixToReturn")) {
      matrixToReturn <- newMatrix
    }
    else {
      matrixToReturn <- rbind2(matrixToReturn,newMatrix)
    }
  }
  colnames(matrixToReturn) <- allColnames
  matrixToReturn  
}

IBrum · Accepted Answer · 2017-03-30T14:00:53.987

4

It looks it's necessary to have empty columns (columns with 0s) added to the matrices so to make them compatible for a rbind (matrices with the same column names, and on the same order). The following code does it:

# dummy data
set.seed(3344)
A = Matrix(matrix(rbinom(16, 2, 0.2), 4))
colnames(A)=letters[1:4]
B = Matrix(matrix(rbinom(9, 2, 0.2), 3))
colnames(B) = letters[3:5]

# finding what's missing
misA = colnames(B)[!colnames(B) %in% colnames(A)]
misB = colnames(A)[!colnames(A) %in% colnames(B)]

misAl = as.vector(numeric(length(misA)), "list")
names(misAl) = misA
misBl = as.vector(numeric(length(misB)), "list")
names(misBl) = misB

## adding missing columns to initial matrices
An = do.call(cbind, c(A, misAl))
Bn = do.call(cbind, c(B, misBl))[,colnames(An)]

# final bind
rbind(An, Bn)

edited Mar 30 '17 at 14:00

answered Mar 30 '17 at 13:52

IBrum

345
1
9

1

Thanks, really fast solution. Merging two sparse matrix with dimensions: 100.000x5 and 10x5 takes 8 ms. – Frederik Andersen Apr 04 '17 at 14:06
1

the `colnames(B)[!colnames(B) %in% colnames(A)]` (etc) is not very readable (nor fast), I would suggest to replace it with `setdiff(rownames(B), rownames(A))` etc. – plijnzaad Oct 21 '20 at 17:22
1

@plijnzaad has a good point, with a good alternative. – IBrum Apr 05 '21 at 17:07

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

Starting from Valentin's answer above, I made my own merge.sparse function, to achieve the following:

keep both column and row names (and of course take them into account when merging)
keep the original order of the row and column names, only merging common ones

The code below seems to do that:

if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)

merge.sparse <- function(...) {
  
  cnnew <- character()
  rnnew <- character()
  x <- vector()
  i <- numeric()
  j <- numeric()
  
  for (M in list(...)) {
  
  cnold <- colnames(M)
  rnold <- rownames(M)
  
  cnnew <- union(cnnew,cnold)
  rnnew <- union(rnnew,rnold)
  
  cindnew <- match(cnold,cnnew)
  rindnew <- match(rnold,rnnew)
  ind <- unname(which(M != 0,arr.ind=T))
  i <- c(i,rindnew[ind[,1]])
  j <- c(j,cindnew[ind[,2]])
  x <- c(x,M@x)
  }
  
  sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}

I tested it with the following data:

df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)

Which gave:

Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
#   N R S T U V W X A B C K1 K2 K3 K4
#M  0 1 1 1 0 1 0 0 0 0 1  0  0  0  0
#N  1 1 1 0 0 1 1 0 1 0 0  0  0  0  0
#X  0 0 1 1 0 0 0 0 0 0 0  0  0  0  0
#Z  0 0 0 1 2 0 0 1 0 0 1  0  0  0  0
#K  0 0 0 1 0 0 1 0 0 1 0  0  0  0  0
#F  1 1 0 0 0 0 0 0 0 0 0  0  0  0  0
#G  0 1 1 0 0 0 0 0 0 0 0  0  0  0  0
#H  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#I  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#L  0 0 0 0 1 0 0 0 0 0 0  0  0  0  0
#J1 0 0 0 0 0 0 0 0 0 0 0  1  0  0  0
#J2 0 0 0 0 0 0 0 0 0 0 0  0  1  0  0
#J3 0 0 0 0 0 0 0 0 0 0 0  0  0  1  0
#J4 0 0 0 0 0 0 0 0 0 0 0  0  0  0  1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
#   [[ suppressing 15 column names ‘N’, ‘R’, ‘S’ ... ]]
#                                
#M  . 1 1 1 . 1 . . . . 1 . . . .
#N  1 1 1 . . 1 1 . 1 . . . . . .
#X  . . 1 1 . . . . . . . . . . .
#Z  . . . 1 2 . . 1 . . 1 . . . .
#K  . . . 1 . . 1 . . 1 . . . . .
#F  1 1 . . . . . . . . . . . . .
#G  . 1 1 . . . . . . . . . . . .
#H  . . . 1 . . . . . . . . . . .
#I  . . . 1 . . . . . . . . . . .
#L  . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1

I don't know why column names are 'suppressed' when trying to display the merged sparse matrix Ms; converting to a non-sparse matrix does bring them back, so...

Also, I noticed that when the same 'coordinates' are included multiple times, the sparse matrix contains the sum of the corresponding values in x (see row "Z", column "U", which is 1 in both M1 and M2). Maybe there is a way to change that, but for my applications this is fine.

I though I'd share this code in case anyone else needed to merge sparse matrices this way, and in case someone can test it on large matrices and suggest performance improvements.

EDIT

After checking this post I found that the extraction of the information about (non-zero) elements of the sparse matrix can be done much more easily by summary, without using which.

So this part of my code above:

ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)

can be replaced by:

ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])

Now I don't know which of these is computationally more efficient, or of there is an even easier way to do this by changing the dimensions of matrices and then just summing them, but this seems to work for me, so...

I tested all the tips here and found that your code is the fastest one. — YWu, Sep 07 '21 at 11:37

dww · Answer 4 · 2017-03-31T02:22:39.433

We can create an empty sparse Matrix that has all the rows and columns, then insert the values into it using subset assignment:

my.bind = function(A, B){
  C = Matrix(0, nrow = NROW(A) + NROW(B), ncol = length(union(colnames(A), colnames(B))), 
             dimnames = list(c(rownames(A), rownames(B)), union(colnames(A), colnames(B))))
  C[rownames(A), colnames(A)] = A
  C[rownames(B), colnames(B)] = B
  return(C)
}

my.bind(A,B)
# 2 x 3 sparse Matrix of class "dgCMatrix"
#      AAAA BBBB CCCC
# XXXX    1    2    .
# YYYY    .    3    4

Note that the above assumes that the A and B do not share row names. If there are shared row names, then you should use row numbers instead of names for the assignment.

The data:

library(Matrix)
A = Matrix(c(1,2), 1, dimnames = list('XXXX', c('AAAA','BBBB')))
B = Matrix(c(3,4), 1, dimnames = list('YYYY', c('BBBB','CCCC')))

Thanks. Elegant solution, but a bit slow on larger matrices. I tried merging two sparse matrices with the dimensions: 100.000x5 and 10x5. It takes 4,3 seconds. — Frederik Andersen, Apr 04 '17 at 14:03

score 0 · Answer 5 · answered Jan 04 '18 at 10:14

If one needs to combine/concatenate many small sparse matrices into one large sparse matrix, it's much better and more efficient to use a mapping of global and local row and column indices to construct a large sparse matrix. E.g.,

globalInds <- matrix(NA, nrow=dim(localPairRowColInds)[1], 2)

# extract the corresponding global row indices for the local row indices
globalInds[ , 1] <- globalRowInds[ localPairRowColInds[,1] ] 
globalInds[ , 2] <- globalColInds[ localPairRowColInds[,2] ]

write.table(cbind(globalInds, localPairVals), file=dataFname, append = T, sep = " ", row.names = F, col.names = F)

r - Binding sparse matrices of different sizes on rows

5 Answers5

EDIT

Linked