Starting from Valentin's answer above, I made my own merge.sparse function, to achieve the following:
- keep both column and row names (and of course take them into account when merging)
- keep the original order of the row and column names, only merging common ones
The code below seems to do that:
if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)
merge.sparse <- function(...) {
cnnew <- character()
rnnew <- character()
x <- vector()
i <- numeric()
j <- numeric()
for (M in list(...)) {
cnold <- colnames(M)
rnold <- rownames(M)
cnnew <- union(cnnew,cnold)
rnnew <- union(rnnew,rnold)
cindnew <- match(cnold,cnnew)
rindnew <- match(rnold,rnnew)
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
}
sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}
I tested it with the following data:
df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)
Which gave:
Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
# N R S T U V W X A B C K1 K2 K3 K4
#M 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0
#N 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0
#X 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
#Z 0 0 0 1 2 0 0 1 0 0 1 0 0 0 0
#K 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
#F 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#G 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#H 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#L 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#J1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
#J2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#J3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#J4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
# [[ suppressing 15 column names ‘N’, ‘R’, ‘S’ ... ]]
#
#M . 1 1 1 . 1 . . . . 1 . . . .
#N 1 1 1 . . 1 1 . 1 . . . . . .
#X . . 1 1 . . . . . . . . . . .
#Z . . . 1 2 . . 1 . . 1 . . . .
#K . . . 1 . . 1 . . 1 . . . . .
#F 1 1 . . . . . . . . . . . . .
#G . 1 1 . . . . . . . . . . . .
#H . . . 1 . . . . . . . . . . .
#I . . . 1 . . . . . . . . . . .
#L . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1
I don't know why column names are 'suppressed' when trying to display the merged sparse matrix Ms
; converting to a non-sparse matrix does bring them back, so...
Also, I noticed that when the same 'coordinates' are included multiple times, the sparse matrix contains the sum of the corresponding values in x
(see row "Z", column "U", which is 1 in both M1
and M2
). Maybe there is a way to change that, but for my applications this is fine.
I though I'd share this code in case anyone else needed to merge sparse matrices this way, and in case someone can test it on large matrices and suggest performance improvements.
EDIT
After checking this post I found that the extraction of the information about (non-zero) elements of the sparse matrix can be done much more easily by summary
, without using which
.
So this part of my code above:
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
can be replaced by:
ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])
Now I don't know which of these is computationally more efficient, or of there is an even easier way to do this by changing the dimensions of matrices and then just summing them, but this seems to work for me, so...