12

How could I Replace a NA with mean of its previous and next rows in a fast manner?

  name grade
1    A    56
2    B    NA
3    C    70
4    D    96

such that B's grade would be 63.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
sia
  • 133
  • 1
  • 7
  • What if the adjacent value is missing as well? Maybe try [this approach](http://stackoverflow.com/questions/22736316/r-missing-value-replacement-function/22736656#22736656) ? – Robert Krzyzanowski Apr 07 '14 at 15:26

3 Answers3

18

Or you may try na.approx from package zoo: "Missing values (NAs) are replaced by linear interpolation"

library(zoo)
x <- c(56, NA, 70, 96)
na.approx(x)
# [1] 56 63 70 96

This also works if you have more than one consecutive NA:

vals <- c(1, NA, NA, 7, NA, 10)
na.approx(vals) 
# [1]  1.0  3.0  5.0  7.0  8.5 10.0

na.approx is based on the base function approx, which may be used instead:

vals <- c(1, NA, NA, 7, NA, 10)
xout <- seq_along(vals)
x <- xout[!is.na(vals)]
y <- vals[!is.na(vals)]

approx(x = x, y = y, xout = xout)$y
# [1]  1.0  3.0  5.0  7.0  8.5 10.0
Henrik
  • 65,555
  • 14
  • 143
  • 159
11

Assume you have a data.frame df like this:

> df
  name grade
1    A    56
2    B    NA
3    C    70
4    D    96
5    E    NA
6    F    95

Then you can use the following:

> ind <- which(is.na(df$grade))
> df$grade[ind] <- sapply(ind, function(i) with(df, mean(c(grade[i-1], grade[i+1]))))
> df
  name grade
1    A    56
2    B    63
3    C    70
4    D    96
5    E  95.5
6    F    95
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • used this to do the following: If x=condition, replace x and next 2 values by mean of x-1 and x+3. which changes the code to: `ind <- which(df$grade<(-100))` and `df$grade[ind:ind+2] <- sapply(ind, function(i) with(df, mean(c(grade[i-1], grade[i+3]))))` For x<-100 – Anne Nov 06 '15 at 13:11
  • As an alternative to the `sapply` call, you could also use: `df$grade[ind] <- with(df, ((grade[ind-1] + grade[ind+1])/2))` – Jaap Mar 31 '17 at 13:36
0

An alternative solution, using the median instead of mean, is represented by the na.roughfix function of the randomForest package. As described in the documentation, it works with a data frame or numeric matrix. Specifically, for numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

Using the same examples as @Henrik,

library(randomForest)
x <- c(56, NA, 70, 96) 
na.roughfix(x)

#[1] 56 70 70 96

or with a larger matrix:

y <- matrix(1:50, nrow = 10)
y[sample(1:length(y), 4, replace = FALSE)] <- NA
y
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1   11   21   31   41
# [2,]    2   12   22   32   42
# [3,]    3   NA   23   33   NA
# [4,]    4   14   24   34   44
# [5,]    5   15   25   35   45
# [6,]    6   16   NA   36   46
# [7,]    7   17   27   37   47
# [8,]    8   18   28   38   48
# [9,]    9   19   29   39   49
# [10,]   10  20   NA   40   50

na.roughfix(y)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1   11 21.0   31   41
# [2,]    2   12 22.0   32   42
# [3,]    3   16 23.0   33   46
# [4,]    4   14 24.0   34   44
# [5,]    5   15 25.0   35   45
# [6,]    6   16 24.5   36   46
# [7,]    7   17 27.0   37   47
# [8,]    8   18 28.0   38   48
# [9,]    9   19 29.0   39   49
#[10,]   10   20 24.5   40   50
Nemesi
  • 781
  • 3
  • 13
  • 29