Interpolating missing data in a dataframe with R

Question

I have a dataframe which is similar to the one below:

Country Ccode Year Happiness Power   
1  France    FR 2000      1000  1000  
2  France    FR 2001        NA    NA
3  France    FR 2002        NA    NA
4  France    FR 2003      1600  2200
5  France    FR 2004        NA    NA
6      UK    UK 2000      1000  1000  
7      UK    UK 2001        NA    NA
8      UK    UK 2002      1000  1000  
9      UK    UK 2003      1000  1000
10     UK    UK 2004      1000  1000

I have previously used the following code to get the differences:

df <- df %>%
  arrange(country, year) %>%  #sort data
  group_by(country) %>%
  mutate_if(is.numeric, funs(d = . - lag(.)))

I would like expand on this code by calculating the difference between the data points of Happiness and Power, divide it by the difference in years between the data points and calculate the values to replace the NA's with, resulting in the following output.

Country Ccode Year Happiness Power   
1  France    FR 2000      1000  1000  
2  France    FR 2001      1200  1400    
3  France    FR 2002      1400  1800
4  France    FR 2003      1600  2200
5  France    FR 2004        NA    NA
6      UK    UK 2000      1000  1000  
7      UK    UK 2001        0      0
8      UK    UK 2002      1000  1000  
9      UK    UK 2003      1000  1000
10     UK    UK 2004      1000  1000

What would be an efficient way of carrying out this task?

EDIT: Please note that also France 2004 is NA. The extend function does seem to properly deal with such a situation.

EDIT 2: Adding the group_by(country) seems to mess things up for unknown reasons:It seems that the code is trying to convert a character to a numeric, although I do not really understand why. When I convert the column to character, the error becomes an evaluation error. Any suggestions?

> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(TRcomplete$F116.s)
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(as.character(TRcomplete$F116.s))
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s))
Error: unexpected ')' in "TRcomplete$F116.s <- as.character(TRcomplete$F116.s))"
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s)
> str(TRcomplete$F116.s)
 chr [1:6984] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Evaluation error: need at least two non-NA values to interpolate.

You sure year 2003 for France is 1400? since 2001 to 2003 is 2 year and 2003 to 2004 is 1 one year only — BENY, Aug 14 '18 at 15:33

Onyambu · Accepted Answer · 2018-08-15T16:18:04.760

4

You can use na.fill with fill="extend" from the zoo library

rapply(df, zoo::na.fill,"integer",fill="extend",how="replace")
  Country Ccode Year Happiness Power
1  France    FR 2000      1000  1000
2  France    FR 2001      1200  1400
3  France    FR 2003      1400  1800
4  France    FR 2004      1600  2200
5      UK    UK 2000      1000  1000
6      UK    UK 2001      1000  1000
7      UK    UK 2003      1000  1000
8      UK    UK 2004      1000  1000

EDIT:

library(tidyverse)
library(zoo)
df%>%
  group_by(Country)%>%
  mutate_at(4:5,~na.fill(.x,"extend"))

  Country Ccode Year Happiness Power
1  France    FR 2000      1000  1000
2  France    FR 2001      1200  1400
3  France    FR 2003      1400  1800
4  France    FR 2004      1600  2200
5      UK    UK 2000      1000  1000
6      UK    UK 2001      1000  1000
7      UK    UK 2003      1000  1000
8      UK    UK 2004      1000  1000

If all the elements in the group are NA then:

df%>% 
  group_by(Country)%>% 
  mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))

edited Aug 15 '18 at 16:18

answered Aug 14 '18 at 15:24

Onyambu

67,392
3
24
53

Thank you for your help! That library makes it a lot easier. Would it be possible to make the solution conditional on whether the country is still the same and/or the year is still the next one?. The code currently also extends when the first and last row of a country are NA, which is something I should prevent at all cost. – Tom Aug 14 '18 at 15:42
@TomKisters Just group by country – Onyambu Aug 14 '18 at 15:58
Thanks a lot! I was trying to run the code again with the addition, but this time it's giving me a weird error: `Error in mutate_impl(.data, dots) : Column can't be converted from character to numeric` – Tom Aug 14 '18 at 16:11
@TomKisters Just run the specified columns. ie only the numeric columns. That is column four and column five. you can use `df%>% group_by(Country)%>% mutate_if(is.numeric,~na.fill(.x,"extend"))` – Onyambu Aug 14 '18 at 16:23
Thank you so much for the follow up. I'll mess with it a bit tomorrow. The weird thing is that the error is referring to a numerical column (in addition, like I said, the code (without the group by country) ran perfectly the first time. I'm going to look at it with a fresh head in the morning. Thanks a lot! – Tom Aug 14 '18 at 16:28
I have not really been able to figure out what is going wrong. I have added some console output. – Tom Aug 15 '18 at 08:39

Interpolating missing data in a dataframe with R

1 Answers1

EDIT: