1

I have a panel dataset with 10 variables for 60 countries, across 18 years (2000-2017), and I have a lot of missing data.

Country Year    Broadband

Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000

I would like to interpolate using the na.approx function in R (and extrapolate using rule = 2), but only within each country. In this sample dataset for example, I want to interpolate the value for Albania 2006, and extrapolate for Albania 2000-2004 and 2017. But I want to make sure that the value for Albania 2017 isn't interpolated using Albania 2016 and Algeria 2003. For Algeria 2000-2002, I want the values to be extrapolated using the data for Algeria 2003 and 2004. I have tried the following code:

data <- group_by(data, country)
data$broadband <- na.approx(data$broadband, maxgap = Inf, rule = 2)
data <- as.data.frame(data)

and have tried different values for maxgap, but none seem to fix my problem. I assumed by using the group_by function it would work correctly but it doesn't. Does anyone know of any solutions?

EDIT: The only way I have thought of to do what I need is to split the dataset into a separate dataset for each unique country, using the following code:

mylist <- split(data, data$country)

alb <- mylist[1]
alb <- as_data_frame(alb)
alg <- mylist[2]
alg <- as_data_frame(alg)
ang <- mylist[3]
ang <- as_data_frame(ang)

and then use the na.approx function on the separate datasets one at a time.

EDIT 2:

I have tried the solution suggested by Markus below, and it doesn't seem to work. This is the result using your suggested coded for values for Angola:

Country Year    Broadband   Broadband_imp

Algeria 2014    1599692 1599692
Algeria 2015    2269348 2269348
Algeria 2016    2858906 2858906
Angola  2000    NA  2451556.286
Angola  2001    NA  2044206.571
Angola  2002    NA  1636856.857
Angola  2003    NA  1229507.143
Angola  2004    NA  822157.429
Angola  2005    NA  414807.714
Angola  2006    7458    7458
Angola  2007    11700   11700

As you can see, the imputed values for Angola 2000-2005 seem to have been calculated using values from Algeria, as the imputed values are much higher than they should be given the Angola 2006 value of 7458.

EDIT 3: This is the full code I have used -

data <- read_excel("~/Documents/data.xlsx")

> dput(head(data))
structure(list(continent = c("Europe", "Europe", "Europe", "Europe", 
"Europe", "Europe"), country = c("Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania"), Year = c(2000, 2001, 2002, 
2003, 2004, 2005), `Individuals Using Internet, %, WB` = c(0.114097347, 
0.325798377, 0.390081273, 0.971900415, 2.420387798, 6.043890864
), `Secure Internet Servers, WB` = c(NA, 1, NA, 1, 2, 1), `Mobile Cellular 
Subscriptions, WB` = c(29791, 
392650, 851000, 1100000, 1259590, 1530244), `Fixed Broadband Subscriptions, 
WB` = c(NA, 
NA, NA, NA, NA, 272), `Trade, % GDP, WB` = c(55.9204287230026, 
57.4303612453301, 63.9342407411882, 65.4406219482911, 66.3578254370479, 
70.2953012017195), `Air transport, freight (million ton-km)` = c(0.003, 
0.003, 0.144, 0.088, 0.099, 0.1), `Air Transport, registered carrier 
departures worldwide, WB` = c(3885, 
3974, 3762, 3800, 4104, 4309), `FDI, net, inflows, % GDP, WB` = 
c(3.93717707227928, 
5.10495722596557, 3.04391445388559, 3.09793068135411, 4.66563777108359, 
3.21722676118428), `Number of Airports, WFB` = c(10, 11, 11, 
11, 11, 11), `Currently under EU Arms Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under EU Economic Sanctions` = c(0, 0, 0, 
0, 0, 0), `Currently under UN Arms Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under UN Economic Sanctions` = c(0, 0, 0, 0, 
0, 0), `Currently under US Arms Embargo` = c(0, 0, 0, 0, 0, 0
), `Currently under US Economic Sanctions` = c(0, 0, 0, 0, 0, 
0)), .Names = c("continent", "country", "Year", "Individuals Using Internet, 
%, WB", 
"Secure Internet Servers, WB", "Mobile Cellular Subscriptions, WB", 
"Fixed Broadband Subscriptions, WB", "Trade, % GDP, WB", "Air transport, 
freight (million ton-km)", 
"Air Transport, registered carrier departures worldwide, WB", 
"FDI, net, inflows, % GDP, WB", "Number of Airports, WFB", "Currently under EU 
 Arms Sanctions", 
"Currently under EU Economic Sanctions", "Currently under UN Arms Sanctions", 
"Currently under UN Economic Sanctions", "Currently under US Arms Embargo", 
"Currently under US Economic Sanctions"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

 data_imputed <- data %>% 
group_by(country) %>% 
mutate(broadband_imp = na.approx(broadband, maxgap=Inf, rule = 2))
Lee Tagziria
  • 53
  • 2
  • 10
  • did you solve the problem? Strangely the output of `dput` does not give what it should. A few `)` are missing or on the wrong place. `Names` contains only four entries (and a missing `)`) though you have seven variables in your data. If you send your entire data set I will see what I can do. Best, Markus – markus Oct 28 '17 at 08:52
  • Hi. My dataset is massive so I don't really know how to send it over! I'll edit the question so you can see the full code of what I have done so far, with the dput again. Thanks so much for you help! – Lee Tagziria Oct 29 '17 at 19:18
  • I am running out of ideas but it might be a package conflict. Replace `mutate` by `dplyr::mutate` in your code. If it works, I'd edit my answer in this regard. Best. – markus Oct 30 '17 at 10:16
  • Alas, it makes no difference. I would like to thank you very much for your help, but I fear this may be a lost cause! – Lee Tagziria Oct 30 '17 at 15:42
  • 1
    I can not reproduce your results other than using `plyr::mutate` instead of `dplyr::mutate`. If you'd edit your answer and include the output of `dput(select(data, country, broadband))` I could to the interpolation for you and include the results in my answer?! – markus Nov 01 '17 at 15:10

1 Answers1

3

You can use group_by and mutate:

library(tidyverse)
library(zoo)

df_imputed <- df %>% 
group_by(Country) %>% 
mutate(Broadband_imputed = na.approx(Broadband, maxgap = Inf, rule = 2))

Which gives

> head(df_imputed)
# A tibble: 6 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Albania  2000        NA               272
2 Albania  2001        NA               272
3 Albania  2002        NA               272
4 Albania  2003        NA               272
5 Albania  2004        NA               272
6 Albania  2005       272               272

and

> df_imputed %>% filter(Country == 'Algeria')
# A tibble: 5 x 4
# Groups:   Country [1]
  Country  Year Broadband Broadband_imputed
   <fctr> <int>     <int>             <dbl>
1 Algeria  2000        NA             18000
2 Algeria  2001        NA             18000
3 Algeria  2002        NA             18000
4 Algeria  2003     18000             18000
5 Algeria  2004     36000             36000

DATA

df <- read.table(text = "Country Year    Broadband
Albania 2000    NA
Albania 2001    NA
Albania 2002    NA
Albania 2003    NA
Albania 2004    NA
Albania 2005    272
Albania 2006    NA
Albania 2007    10000
Albania 2008    64000
Albania 2009    92000
Albania 2010    105539
Albania 2011    128210
Albania 2012    160088
Albania 2013    182556
Albania 2014    207931
Albania 2015    242870
Albania 2016    263874
Albania 2017    NA
Algeria 2000    NA
Algeria 2001    NA
Algeria 2002    NA
Algeria 2003    18000
Algeria 2004    36000", header = TRUE)
markus
  • 25,843
  • 5
  • 39
  • 58
  • Unfortunately I don't think this has worked properly. I think the problem occurs when there are countries that have data for e.g. 2004-2017 but no data for 2000-2003, and therefore no interpolation can be calculated between values within the same country. Therefore, it is forced to use values from adjacent countries for interpolation. I'm not sure this is correct as I'm not an expert on how the na.approx function works, but my understanding is that it is predominantly designed for interpolation, with an option for extrapolation. Is there a function that is specifically for extrapolation? – Lee Tagziria Oct 27 '17 at 13:21
  • The function carries the first observation backwards for each country, no? – markus Oct 27 '17 at 13:26
  • I have edited to question to respond in more depth to your question – Lee Tagziria Oct 27 '17 at 13:40
  • Strange. When I include Angola I get a different output. Can you include the output of `dput(head()` in the question? (Where `` is replaced with the name of your data frame, of course.) – markus Oct 27 '17 at 13:48