1

I have a dataframe with a column about time and this column contains some NA. I would like to fill these cells with the year before + 1 (if the missing cell is not the beginning of the serie). Here's a reproducible example:

df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(2000, NA, NA, 2000, 2001, 2002))

I tried to follow this post

df <- df %>%
  complete(y = seq(min(y), max(y), by = "year"))

but I can't find out how to do so. Any idea?

Edit: expected output:

df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(2000, 2001, 2002, 2000, 2001, 2002))

Note: I would prefer a dplyr solution.

Note 2 (October 23rd 2019): The three answers so far are good but quite complicated. I'm really surprised that it is not possible to do that simply (for example, having the possibility to add a lag in the fill function would be really useful I think).

bretauv
  • 7,756
  • 2
  • 20
  • 57
  • Hi, Could you post the desired output too? – Stephen Henderson Oct 23 '19 at 13:59
  • sure, I added it – bretauv Oct 23 '19 at 14:00
  • 1
    it's not clear what you want to happen when say the first value of y is an `NA`? Or are you saying that cannot happen? – Stephen Henderson Oct 23 '19 at 14:24
  • Will you have one such gap or many scuh gaps? – s_baldur Oct 23 '19 at 15:08
  • when they are NA some where, there are on the full serie. Here, I put 2000 to have a point from where I could fill the other NA but originally, the dataframe was of the form: ```df <- data.frame(x = c("A", "B", "C", "A", "B", "C"), y = c(NA, NA, NA, 2000, 2001, 2002))``` – bretauv Oct 23 '19 at 15:11
  • so if you have a solution to fill for all the NA, I will take it but I thought it would be easier to find a solution if I manually changed the starting point of the serie and to fill the other ```NA``` from this point – bretauv Oct 23 '19 at 15:12

3 Answers3

1

In base you can use ave in combination with cumsum to split your dataset and apply there seq, as you have tried already.

df$y <- ave(df$y, cumsum(!is.na(df$y)), FUN=function(x)
    seq(x[1], length.out = length(x)))
identical(df, dfExpected)
#[1] TRUE
df$y
#[1] 2000 2001 2002 2000 2001 2002

In case it starts with NA and you want then to let it start with 2000 you can use replace:

df2$y <-ave(df2$y, cumsum(!is.na(df2$y)), FUN=function(x) 
   seq(replace(x[1],is.na(x[1]),2000), length.out = length(x)))
identical(df2, dfExpected)
#[1] TRUE

Data:

df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(2000, NA, NA, 2000, 2001, 2002))
dfExpected <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(2000, 2001, 2002, 2000, 2001, 2002))
df2 <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(NA, NA, NA, 2000, 2001, 2002))

GKi
  • 37,245
  • 2
  • 26
  • 48
  • that works, but I would prefer a ```dplyr``` solution (I would validate your answer later if there are no more answers proposed) – bretauv Oct 23 '19 at 14:14
1

This solution is a bit annoying but completely vectorized in dplyr. I doubled your df into a new df2 to try across a couple gapped occurrences.

library(tidyr)
library(dplyr)

df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
                 y = c(2000, NA, NA, 2000, 2001, 2002))

df2 <- bind_rows(df, df) 

Basically you need to create groups across the blocks with NA. Then you can calculate a within-group cumsum and use fill to drag down the prior value. It is annoying because of all the lines.

df2 %>%
  group_by(grp = cumsum(!is.na(y) & lag(is.na(y), default = FALSE))) %>%
  mutate(add_year = cumsum(is.na(y))) %>%
  fill(y) %>%
  mutate(y = y + add_year) %>%
  ungroup() %>%
  select(-grp, -add_year)
  • there must be an error in your Edit code, the output for ```df2$y``` is: ```[1] 2000 2001 2002 2000 2000 2000 2000 2001 2002 2002 2002 2002``` – bretauv Oct 23 '19 at 17:16
  • Ah yeah I think there might be. I tagged that on fast. I'll just clear the edit code. –  Oct 23 '19 at 17:20
0

This uses dplyr functions case_when() and lag combined with a while-loop in a custom-function.

Output is as expected, try it out.

library(dplyr)
lag_years <- function(df){
  while (anyNA(df$y))
    {
    df %>%
      mutate(y = case_when(is.na(y)&!is.na(lag(y))~lag(y)+1,TRUE~y)) %>%
      {.} -> df
  }
  return(df)
}

lag_years(df) %>%
head()
Fnguyen
  • 1,159
  • 10
  • 23
  • your code works, but when I apply it to a dataframe of several thousands of lines, it takes way too long (from my point of view) – bretauv Oct 23 '19 at 17:06
  • @bretauv I feared as much, the while-loop really is brute-force. There might be a way to optimize the solution but I fear GKi's way might be better for now. – Fnguyen Oct 23 '19 at 18:55