Is there an R function for imputing missing year values, consecutively, by group?

Question

My dataframe looks like:

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

I would like to be able to impute the missing year values by ID, with the following desired results:

wanted_df <- data.frame(ID=c("A", "A", "A", "A", 
                             "B", "B", "B", "B",
                             "C", "C", "C", "C",
                             "D", "D", "D", "D"),
                       grade=c("KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03"),
                       year=c(2002, 2003, 2004, 2005,
                              2007, 2008, 2009, 2010,
                              2004, 2005, 2006, 2007,
                              2009, 2010, 2011, 2012))

I have attempted to impute the values using:

lag() and lead() functions
Joining to a dataframe consisting of years

Neither have worked. Any help would be greatly appreciated. Thank you.

score 3 · Accepted Answer · answered Dec 08 '21 at 18:30

We may use na_interpolate/na_extrapolate

library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>% 
   group_by(ID) %>% 
   mutate(year = na_extrapolate(na_interpolate(year))) %>%
   ungroup

-output

# A tibble: 16 × 3
   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG    2002 
 2 A     01    2003 
 3 A     02    2004 
 4 A     03    2005 
 5 B     KG    2007 
 6 B     01    2008 
 7 B     02    2009 
 8 B     03    2010 
 9 C     KG    2004.
10 C     01    2005 
11 C     02    2006 
12 C     03    2007 
13 D     KG    2009 
14 D     01    2010 
15 D     02    2011 
16 D     03    2012.

Baraliuh · Answer 2 · 2021-12-08T19:06:28.337

Could do something like this:

year_imputer <- function(years){
    # Find one non-missing data-point
    ref_indx <- which(!is.na(years))[1]
    # Make it the reference point
    ref <- years[ref_indx]
    # Get the length of the years
    years_len <- length(years)
    # Generate the sequence
    (ref - (ref_indx - 1)):(ref+(years_len - ref_indx))
}
library(dplyr)
df %>% 
    group_by(ID) %>% 
    mutate(
        year = year_imputer(year)
    ) %>% 
    ungroup()

Output:

# A tibble: 16 x 3
   ID    grade  year
   <chr> <chr> <int>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

TarJae · Answer 3 · 2021-12-08T19:04:51.163

This is not a solution that I would recommend: But this question raised another question, see here Run ifelse x times with one call (quasi iteratively until condition is fulfilled)

library(dplyr)
df %>% 
  group_by(ID) %>% 
  mutate(year= ifelse(is.na(year), lag(year)+1, year),
         year= ifelse(is.na(year), lag(year)+1, year),
         year= ifelse(is.na(year), lead(year)-1, year))

   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

PaulS · Answer 4 · 2021-12-08T22:36:54.857

Yet another solution:

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0, 
                  min(year,na.rm = T) + 1:(n()-which.min(year)))) %>% ungroup

#> # A tibble: 16 × 3
#>    ID    grade  year
#>    <chr> <chr> <dbl>
#>  1 A     KG     2002
#>  2 A     01     2003
#>  3 A     02     2004
#>  4 A     03     2005
#>  5 B     KG     2007
#>  6 B     01     2008
#>  7 B     02     2009
#>  8 B     03     2010
#>  9 C     KG     2004
#> 10 C     01     2005
#> 11 C     02     2006
#> 12 C     03     2007
#> 13 D     KG     2009
#> 14 D     01     2010
#> 15 D     02     2011
#> 16 D     03     2012

score 0 · Answer 5 · answered Dec 08 '21 at 19:47

You can use any (linear) interpolation function for your specific problem. E.g. from imputeTS package, but also from zoo and others. But this solution probably wouldn't work any more, when you are dealing with POSIXct data types instead of just numbers. Also worth noting, that this only works as long as every missing yeear is actually inserted as NA values (instead of just left out). For this case (missing years just left out) called implicit missing values the tsibble package has a function called fill_gaps().

library("imputeTS")
library("dplyr")

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup

Is there an R function for imputing missing year values, consecutively, by group?

5 Answers5

Linked