0

I have a dataset with 5 columns where each column is a measure. Im trying to replace those NA with the mean of previous and following value but no luck till now. I searched and I found a solution how to replace NA with mean of row, which can't be applied here. Any help will be appreciated. I hope this thread will help others as well.

  A  B  C  D  E
 NA  7 NA  6  2
  4 NA  2  6 NA
  2  1 NA NA  5

Desired output

A B C D E
6.5 7 6.5 6 2
4 3 2 6 4
2 1 2 2 5

In the first row 6.5 comes from the mean of 7 and 6, in the second row 3 comes from mean of 4 and 2, and 4 from the mean 2 and 6. In the third row the 2 comes from the mean of 1 and 5.

Thank you.

klo
  • 189
  • 4
  • 14
  • What is your desired output? – n8sty Oct 16 '14 at 18:51
  • @n8sty i think will be correct if i replace with mean of previous and following or in case that previous is missing with 2 following measures. Output should be: `6.5 7 6.5 6 2` – klo Oct 16 '14 at 18:57
  • add your desired output to your question and explain how you got to it. Where does the the first 6.5 come from? – n8sty Oct 16 '14 at 19:01
  • @n8sty just added with explanation. Hope is clear now. Thank you. – klo Oct 16 '14 at 19:11
  • It doesn't really make sense how the endpoint `NA` values are calculated, compared to the middle `NA` calculations. What is there are more columns? – Rich Scriven Oct 16 '14 at 19:18
  • @Richard-Scriven yes there will be other columns. No chance to make it? – klo Oct 16 '14 at 19:30
  • @user3872487 i think your only hope is to try and do something with logical conditioning and `zoo::na.locf'. You want to the calculation to act one way if you're on the edge of the table/matrix and differently if you're not. – n8sty Oct 16 '14 at 21:48

2 Answers2

1

All the following require this setup:

library(zoo)
m <- matrix(c(NA, 4L, 2L, 7L, NA, 1L, NA, 2L, NA, 6L, 6L, NA, 2L, NA, 5L), 3) # input

1) I think it would be better to extend the values on the ends so with this change try the following:

tm <- t(m)
mid <- (na.locf(tm) + na.locf(tm, fromLast = TRUE))/2
t(na.fill(mid, c("extend", NA)))

giving:

     [,1] [,2] [,3] [,4] [,5]
[1,]    7    7  6.5    6    2
[2,]    4    3  2.0    6    6
[3,]    2    1  3.0    3    5

By the way, note that the average of 1 and 5 is 3, not 2.

2) This is not equivalent but it may be what you really want and is short. Try linear interpoolation extending the end values:

t(na.approx(t(m), rule = 2))

giving:

    [,1] [,2]     [,3]     [,4] [,5]
[1,]    7    7 6.500000 6.000000    2
[2,]    4    3 2.000000 6.000000    6
[3,]    2    1 2.333333 3.666667    5

3) Another possibility, again not equivalent is to replace each NA with the mean of the non-NAs on that row:

t(na.aggregate(t(m)))

giving:

     [,1] [,2]     [,3]     [,4] [,5]
[1,]    5    7 5.000000 6.000000    2
[2,]    4    4 2.000000 6.000000    4
[3,]    2    1 2.666667 2.666667    5

4) This is a mix of (1) and (3). This fills in inner NAs with the mean of the non-NAs to either side and fills in the end NAs with the mean of the non-NAs in that row:

tm <- t(m)
mid <- (na.locf(tm) + na.locf(tm, fromLast = TRUE))/2
ag <- na.aggregate(tm)
t(ifelse(is.na(mid), ag, mid))

giving:

    [,1] [,2] [,3] [,4] [,5]
[1,]    5    7  6.5    6    2
[2,]    4    3  2.0    6    4
[3,]    2    1  3.0    3    5

Update Added more approaches and some correcetions.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • this is what I was looking for `t(na.approx(t(m), rule = 2))`, but in my data frame i have ID variable which is numeric. How i can exclude it from the calculations? – klo Oct 17 '14 at 06:17
  • To exclude the first column, say, from the calculation but just keep it intact try `m[, -1] <- t(na.approx(t(m[, -1]), rule = 2))` . – G. Grothendieck Oct 17 '14 at 11:02
  • I did something like that `df = as.data.frame(t(na.approx(t(df[6:10]), rule = 2))) df = merge(df, dt, by="row.names")` but can be shorter! – klo Oct 17 '14 at 14:49
0

I think the best/simple way to answer my question is to calculate the mean of row and to replace the missing.

I found answer there Find and replace missing values with row mean

ind <- which(is.na(df), arr.ind=TRUE)
df[ind] <- rowMeans(df[2:6],  na.rm = TRUE)[ind[,1]]
df
Community
  • 1
  • 1
klo
  • 189
  • 4
  • 14