4

Consider a dataframe like this:

dat <- data.frame(name = c("John", "John", "John", "John", "Sam", "Sam", "Sam", "Sam", "Emily", "Emily", "Emily", "Emily", "Alex", "Alex", "Alex", "Alex"),
                  grade = c(9, NA, NA, 12, 10, NA, NA, NA, 9, 10, 10, 11, NA, 11, 11, NA),
                  year = c(2007, 2008, 2009, 2010, rep(c(2018, 2019, 2019, 2020), 3)))

    name grade year
1   John     9 2007
2   John    NA 2008
3   John    NA 2009
4   John    12 2010
5    Sam    10 2018
6    Sam    NA 2019
7    Sam    NA 2019
8    Sam    NA 2020
9  Emily     9 2018
10 Emily    10 2019
11 Emily    10 2019
12 Emily    11 2020
13  Alex    NA 2018
14  Alex    11 2019
15  Alex    11 2019
16  Alex    NA 2020

I'd like to fill in the missing NAs in grade based on year. So Sam for example is in 10th grade in 2018, and in 2019 he'll be in 11th grade:

dat_desired <- data.frame(name = c("John", "John", "John", "John", "Sam", "Sam", "Sam", "Sam", "Emily", "Emily", "Emily", "Emily", "Alex", "Alex", "Alex", "Alex"),
                  grade = c(9, 10, 11, 12, 10, 11, 11, 12, 9, 10, 10, 11, 10, 11, 11, 12),
                  year = c(2007, 2008, 2009, 2010, rep(c(2018, 2019, 2019, 2020), 3)))

    name grade year
1   John     9 2007
2   John    10 2008
3   John    11 2009
4   John    12 2010
5    Sam    10 2018
6    Sam    11 2019
7    Sam    11 2019
8    Sam    12 2020
9  Emily     9 2018
10 Emily    10 2019
11 Emily    10 2019
12 Emily    11 2020
13  Alex    10 2018
14  Alex    11 2019
15  Alex    11 2019
16  Alex    12 2020

How can I do this? I've tried:

dat %>% 
  group_by(name) %>% 
  fill(grade, .direction = "updown")

which doesn't work because it doesn't take year into account. Adding a group_by(year) doesn't help either, because then no NAs are replaced.

Greg
  • 3,570
  • 5
  • 18
  • 31
  • Maybe it's more complicated than you need, but you could think of this as a linear interpolation problem, like [here](https://stackoverflow.com/q/33696795/5325862) – camille Feb 17 '20 at 15:53
  • Linear interpolation (as ```mutate(gradeInterp = zoo::na.approx(grade, na.rm=FALSE)) ```) can deal well with ```John``` but can't handle the leading and trailing ```NA```s in ```Alex```, or the 2 ```NA```s in a row, plus trailing ```NA``` in ```Sam```. Any thoughts on dealing with those? – Greg Feb 17 '20 at 16:07

2 Answers2

3

Could try:

dat %>%
  group_by(name) %>%
  mutate(
    grade = coalesce(grade, first(grade[!is.na(grade)]) - (first(year[!is.na(grade)]) - year))
  )

Output:

# A tibble: 16 x 3
# Groups:   name [4]
   name  grade  year
   <fct> <dbl> <dbl>
 1 John      9  2007
 2 John     10  2008
 3 John     11  2009
 4 John     12  2010
 5 Sam      10  2018
 6 Sam      11  2019
 7 Sam      11  2019
 8 Sam      12  2020
 9 Emily     9  2007
10 Emily    10  2008
11 Emily    10  2009
12 Emily    11  2010
13 Alex     10  2018
14 Alex     11  2019
15 Alex     11  2019
16 Alex     12  2020
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
1

A rough solution can be implemented with a for loop:

X <- as.character(unique(dat$name))

for ( i in 1:length(X)){
a <- min(dat[dat$name==X[i],]$grade, na.rm = T)
b <- min(dat[dat$name==X[i],]$year, na.rm = T)
dat[dat$name==X[i] & is.na(dat$grade),]$grade <- dat[dat$name==X[i] & is.na(dat$grade),]$year-b+a
}
Greg
  • 3,570
  • 5
  • 18
  • 31