0

i have data like in

DT <- data.frame(id=rep("A",times=10),B=1:10, C=c(NA,2:5,NA,NA,NA,NA,NA))
CT <- DT[,c(2,3)]*3
CT$id <- rep("B",times=10)
DT <- rbind(DT,CT)

I would like to fill in NAs in column C with its previous value plus function of value of other column, column B (ignoring the first NA in col C), for example:

DT$C[6] =DT$C[5]+DT$B[6]*0.3
DT$C[7] =DT$C[6]+DT$B[7]*0.3
DT$C[8] =DT$C[7]+DT$B[8]*0.3

etc.

Also, i need to replicate it by the id value (column id in the table). Example pic with three manually computed values is below. Thanks for suggestions!

enter image description here

nov
  • 157
  • 1
  • 3
  • 15
  • Why is 9 and 10 `NA` ? – akrun Feb 02 '19 at 12:14
  • i posted a picture after i manually computed three values just as an example. Originally i have table that after fifth value is missing, but in my big data it is also for some id sixth, seventh or any other row.. – nov Feb 02 '19 at 12:17
  • I posted a solution. Please check if that is what you wanted – akrun Feb 02 '19 at 12:39

2 Answers2

1

The formula for filling in NA cells is equivalent to filling them in using na.locf plus taking 0.3 times the cumsum of the B values corresponding to the NA values in C.

So first create an expression for a grouping vector associating a unique number to each non-NA and each stretch of consecutive NAs.

rleid(seq_along(C) * !is.na(C)))

For each of those groups compute the vector of cumulative sums of B * is.na(C) which is 0 if the group has one non-NA element and is cumsum if the group consists of consecutive NAs.

This gives the following single statement solution:

library(data.table)
library(zoo)

transform(DT, C = ave(C, id, FUN = na.locf0) + 
      0.3 * ave(B * is.na(C), rleid(seq_along(C) * !is.na(C)), id, FUN = cumsum))

giving:

   id  B    C
1   A  1   NA
2   A  2  2.0
3   A  3  3.0
4   A  4  4.0
5   A  5  5.0
6   A  6  6.8
7   A  7  8.9
8   A  8 11.3
9   A  9 14.0
10  A 10 17.0
11  B  3   NA
12  B  6  6.0
13  B  9  9.0
14  B 12 12.0
15  B 15 15.0
16  B 18 20.4
17  B 21 26.7
18  B 24 33.9
19  B 27 42.0
20  B 30 51.0
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks a lot for answering. For some reason i get this error: Error in data.frame(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : arguments imply differing number of rows: 20, 0 – nov Feb 02 '19 at 13:21
  • Thanks again, there is still one issue that first NA in id==B is filled in, can this somehow be applied to each level of id separately? – nov Feb 02 '19 at 14:04
  • Great, this is exactly what i was looking for – nov Feb 02 '19 at 14:35
0

Here is one approach with accumulate from purrr

library(data.table)
library(purrr)
setDT(DT)[, C := {i1 <- which(is.na(C))[1] 
    replace(C, (i1-1):.N, accumulate(B[i1:.N], ~ .x + .y * 0.3, .init = C[i1-1]))}, 
      by = id]
#    id  B    C
# 1:  A  1  1.0
# 2:  A  2  2.0
# 3:  A  3  3.0
# 4:  A  4  4.0
# 5:  A  5  5.0
# 6:  A  6  6.8
# 7:  A  7  8.9
# 8:  A  8 11.3
# 9:  A  9 14.0
#10:  A 10 17.0
#11:  B  3  3.0
#12:  B  6  6.0
#13:  B  9  9.0
#14:  B 12 12.0
#15:  B 15 15.0
#16:  B 18 20.4
#17:  B 21 26.7
#18:  B 24 33.9
#19:  B 27 42.0
#20:  B 30 51.0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks a lot for answer, but i realized this is not working for me because i have also NAs at the beginning of the C column as well, which i want to impute later with different method. If you just replace first line in example that i wrote with this DT <- data.frame(id=rep("A",times=10),B=1:10, C=c(NA,2:5,NA,NA,NA,NA,NA)) and try to run it it will not change anything? Can i somehow ignore this NA at the beginning of C? – nov Feb 02 '19 at 13:05