4

I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply or sapply I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing wrong?

Example:

set.seed(1928)
mat <- matrix(rnorm(100*110), ncol = 110)
mat[sample(1:length(mat), 700, replace = FALSE)] <- NA
mat1 <- mat2 <- mat

mat1 <- lapply(mat1,
  function(n) {
     mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE)
  }
)   

for (n in 1:ncol(mat2)) {
  mat2[is.na(mat2[,n]),n] <- median(mat2[,n], na.rm = TRUE)
}
smci
  • 32,567
  • 20
  • 113
  • 146
Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45
  • 2
    `matrix` objects are vectors with dimensions. `lapply` will loop over every single value in the matrix instead of the columns. – thelatemail Jan 18 '16 at 23:18
  • 3
    If you're feeling super lazy and don't want to write your own function, you can use `na.roughfix` from the `randomForest` library. It automatically replaces all NA values with median/mode depending on whether it is numeric/factor. – ytk Jan 19 '16 at 01:35
  • @Jonno Bourne, if you're asking about dataframe not matrices, please edit your reproducible example to give a dataframe. Mind you that would invalidate the accepted solution... – smci Jun 05 '17 at 06:01
  • @smci The question doesn't mention dataframes and was successfully answered, using matrices a year and a half ago, can you clarify your comment. – Jonno Bourne Jun 10 '17 at 12:51
  • @JonnoBourne: I know it was answered, that's my point, this vaguely-worded question was **[being (wrongly) cited as a canonical answer elsewhere on SO](https://stackoverflow.com/questions/44362281/how-do-i-change-na-into-column-median#comment75726792_44362281)** for replacing NAs in dataframes. The vague title didn't make clear that it wasn't applicable to dataframes, so the title needed editing. (It turns out there is no canonical answer for "replacing NAs in dataframes by column medians". So we need to prevent questions on that topic wrongly being closed-as-duplicate into this one. Ok? – smci Jun 11 '17 at 03:57
  • Given the context your edits are sensible, if you'd had deleted your comment after making them it would have been less confusing. – Jonno Bourne Jun 12 '17 at 11:07

4 Answers4

7

I would suggest vectorizing this using the matrixStats package instead of calculating a median per column using either of the loops (sapply is also a loop in a sense that its evaluates a function in each iteration).

First, we will create a NAs index

indx <- which(is.na(mat), arr.ind = TRUE)

Then, replace the NAs using the precalculated column medians and according to the index

mat[indx] <- matrixStats::colMedians(mat, na.rm = TRUE)[indx[, 2]]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    I actually realized I wanted to do it on a data frame, but I could convert to a matrix and perform this operation then convert back, thanks. – Jonno Bourne Jan 21 '16 at 10:05
  • 1
    In most cases, you have a numerical data set, it is much more efficient to work with a matrix rather `data.frame`. Even when you use a simple loop. – David Arenburg Jan 21 '16 at 13:53
2

You can use sweep:

sweep(mat, MARGIN = 2, 
      STATS = apply(mat, 2, median, na.rm=TRUE),
      FUN =  function(x,s) ifelse(is.na(x), s, x)
    )

EDIT: You can also drop in STATS=matrixStats::colMedians(mat, na.rm=TRUE) for a little more performance.

Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
1

lapply loops over a list. Do you mean to loop over the columns?

matx <- sapply(seq_len(ncol(mat1)), function(n) {
  mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE)
})

though that's essentially just doing what your loop example does (but presumably faster).

Jonathan Carroll
  • 3,897
  • 14
  • 34
0

You could possibly get there easier via conversion to data.frame and back to matrix as a result, using vapply:

vapply(as.data.frame(mat1), function(x)
   replace(x, is.na(x), median(x,na.rm=TRUE)), FUN.VALUE=numeric(nrow(mat1)) 
)
thelatemail
  • 91,185
  • 12
  • 128
  • 188