3

I have an R function that calculates the Hamming distance of two vectors:

Hamming = function(x,y){
get_dist = sum(x != y, na.rm=TRUE)
return(get_dist)
}

that I would like to apply to every row of two matrices M1, M2 without using a for loop. What I currently have (where L is the number of rows in M1 and M2) is the very time-consuming loop:

xdiff = c()
for(i in 1:L){
    xdiff = c(xdiff, Hamming(M1[i,],M2[i,]))
}

I thought that this could be done by executing

mapply(Hamming, t(M1), t(M2))

(with the transpose because mapply works across columns), but this doesn't generate a length L vector of Hamming distances for each row, so perhaps I'm misunderstanding what mapply is doing.

Is there a straightforward application of mapply or something else in the R apply family that would work?

Max
  • 487
  • 5
  • 19
  • A matrix in `R` is a vector. If you would have the rows of the matrices as elements in two lists, then this could work – Ameya Feb 17 '22 at 22:44
  • I thought this would be the case. Could the issue be due to the fact that M1,M2 were data frames rather than matrix data classes, and should this matter for mapply? – Max Feb 17 '22 at 22:55
  • An `m`-by-`n` data frame is stored as a length-`n` list of length-`m` vectors, and would be handled by `mapply` like any other length-`n` list. How `mapply` handles data frames happens to be irrelevant here, because the transpose of an `m`-by-`n` data frame is an `n`-by-`m` matrix, not an `n`-by-`m` data frame. Try, e.g., `is.matrix(t(data.frame(a=1:2, b=1:2)))`. – Mikael Jagan Feb 17 '22 at 23:11
  • Thanks - I had assumed that an m x n matrix would have been stored as an m-length vector whose elements were n-length vectors. – Max Feb 18 '22 at 23:25

1 Answers1

3

If dim(M1) and dim(M2) are identical, then you can simply do:

rowSums(M1 != M2, na.rm = TRUE)

Your attempt with mapply didn't work because m-by-n matrices are stored as m*n-length vectors, and mapply handles them as such. To accomplish this with mapply, you would need to split each matrix into a list of row vectors:

mapply(Hamming, asplit(M1, 1L), asplit(M2, 1L))

vapply would be better, though:

vapply(seq_len(nrow(M1)), function(i) Hamming(M1[i, ], M2[i, ]), 0L)

In any case, just use rowSums.

Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
  • What is the advantage of using vapply rather than mapply in your examples above? The rowSums alternative works fine for this specific example, but I made up the Hamming distance example in order to ask a broader question of how to use apply for vectors in two matrices, so mapply/vapply are the best general solutions. – Max Feb 21 '22 at 23:19
  • The `mapply` call has more overhead, for a few reasons: (1) `mapply` requires memory for all of the row vectors to be allocated up front and for the entire duration of the call, whereas `vapply` only uses one row vector at a time. (2) `mapply` constructs an intermediate list containing all of the results, before unlisting them into an array, whereas `vapply` copies results as they are generated into a preallocated array. (3) Each `asplit` call involves an R level `for` loop, which is quite slow. – Mikael Jagan Feb 22 '22 at 06:43