3

I am trying to extrapolate the following missing values (NAs) in my data with this line of code but it is not working.

My Data:

 landkreis  jahr     deDomains 
   <chr>     <dbl> <dbl>
 1 Ahrweile…  2007  NA                   
 2 Ahrweile…  2008  NA                
 3 Ahrweile…  2009  NA               
 4 Ahrweile…  2010  NA                  
 5 Ahrweile…  2011  NA                              
 6 Ahrweile…  2012  NA                              
 7 Ahrweile…  2013  22224                               
 8 Ahrweile…  2014  22460                               
 9 Ahrweile…  2015  2379                               
10 Ahrweile…  2016  22769                               
11 Ahrweile…  2017  23268                               
12 Aichach-…  2007  NA                              
13 Aichach-…  2008  NA                              
14 Aichach-…  2009  NA                              
15 Aichach-…  2010  NA                              
16 Aichach-…  2011  NA                              
17 Aichach-…  2012  NA                              
18 Aichach-…  2013  21341                               
19 Aichach-…  2014  21393                               
20 Aichach-…  2015  21338                              

I am trying to extapolate the NAs on the deDomains variable with the following code but it doesn't work

 df_complete <- df_complete %>% 
          group_by(landkreis) %>%
        mutate(`deDomains` = approxExtrap(which(!is.na(`deDomains`)),
`deDomains`[!is.na(`deDomains`)])$y)

I am using the approxExtrap() command from the Hmisc package for linear extrapolation.

jay.sf
  • 60,139
  • 8
  • 53
  • 110

1 Answers1

3

You need to specify your xout. The NAs are actually handled by the function. You may want to look into the approx function where you can find some examples (for interpolation though, but it's similar); type ?approx.

library(dplyr)
library(Hmisc)
df_complete %>% 
  group_by(landkreis) %>%
  mutate(`deDomains`=approxExtrap(x=jahr, y=deDomains, xout=jahr)$y)
# # A tibble: 20 x 3
# # Groups:   landkreis [2]
#    landkreis  jahr deDomains
#    <fct>     <int>     <dbl>
#  1 Ahrweile…  2007     22224
#  2 Ahrweile…  2008     22224
#  3 Ahrweile…  2009     22224
#  4 Ahrweile…  2010     22224
#  5 Ahrweile…  2011     22224
#  6 Ahrweile…  2012     22224
#  7 Ahrweile…  2013     22224
#  8 Ahrweile…  2014     22460
#  9 Ahrweile…  2015      2379
# 10 Ahrweile…  2016     22769
# 11 Ahrweile…  2017     23268
# 12 Aichach-…  2007     21341
# 13 Aichach-…  2008     21341
# 14 Aichach-…  2009     21341
# 15 Aichach-…  2010     21341
# 16 Aichach-…  2011     21341
# 17 Aichach-…  2012     21341
# 18 Aichach-…  2013     21341
# 19 Aichach-…  2014     21393
# 20 Aichach-…  2015     21338

Or using by:

library(Hmisc)
do.call(rbind, by(df_complete, df_complete$landkreis, function(x) {
  transform(x, 
            deDomains=approxExtrap(x=x$jahr, y=x$deDomains, xout=x$jahr)$y
            )
  }))
#              landkreis jahr deDomains
# Ahrweile….1  Ahrweile… 2007     22224
# Ahrweile….2  Ahrweile… 2008     22224
# Ahrweile….3  Ahrweile… 2009     22224
# Ahrweile….4  Ahrweile… 2010     22224
# Ahrweile….5  Ahrweile… 2011     22224
# Ahrweile….6  Ahrweile… 2012     22224
# Ahrweile….7  Ahrweile… 2013     22224
# Ahrweile….8  Ahrweile… 2014     22460
# Ahrweile….9  Ahrweile… 2015      2379
# Ahrweile….10 Ahrweile… 2016     22769
# Ahrweile….11 Ahrweile… 2017     23268
# Aichach-….12 Aichach-… 2007     21341
# Aichach-….13 Aichach-… 2008     21341
# Aichach-….14 Aichach-… 2009     21341
# Aichach-….15 Aichach-… 2010     21341
# Aichach-….16 Aichach-… 2011     21341
# Aichach-….17 Aichach-… 2012     21341
# Aichach-….18 Aichach-… 2013     21341
# Aichach-….19 Aichach-… 2014     21393
# Aichach-….20 Aichach-… 2015     21338

Edit: To extrapolate using a "trend" you may use e.g. na_kalman from the imputeTS package.

library(imputeTS)
res <- do.call(rbind, by(df_complete, df_complete$landkreis, function(x) {
  transform(x, 
            deDomains.ex=na_kalman(x$deDomains, model = "StructTS", smooth = TRUE)
            )
  }))
#              landkreis jahr deDomains deDomains.ex
# Ahrweile….1  Ahrweile… 2007        NA     21532.16
# Ahrweile….2  Ahrweile… 2008        NA     21186.24
# Ahrweile….3  Ahrweile… 2009        NA     20840.32
# Ahrweile….4  Ahrweile… 2010        NA     20494.40
# Ahrweile….5  Ahrweile… 2011        NA     20148.48
# Ahrweile….6  Ahrweile… 2012        NA     19802.56
# Ahrweile….7  Ahrweile… 2013     22224     22224.00
# Ahrweile….8  Ahrweile… 2014     22460     22460.00
# Ahrweile….9  Ahrweile… 2015      2379      2379.00
# Ahrweile….10 Ahrweile… 2016     22769     22769.00
# Ahrweile….11 Ahrweile… 2017     23268     23268.00
# Aichach-….12 Aichach-… 2007        NA     21344.52
# Aichach-….13 Aichach-… 2008        NA     21346.28
# Aichach-….14 Aichach-… 2009        NA     21348.04
# Aichach-….15 Aichach-… 2010        NA     21349.80
# Aichach-….16 Aichach-… 2011        NA     21351.55
# Aichach-….17 Aichach-… 2012        NA     21353.31
# Aichach-….18 Aichach-… 2013     21341     21341.00
# Aichach-….19 Aichach-… 2014     21393     21393.00
# Aichach-….20 Aichach-… 2015     21338     21338.00

There might be better data for demonstration, but anyway let's look at a plot:

plot(deDomains ~ jahr, type="n", data=res)
sapply(seq(res$landkreis), function(x) 
  with(res[res$landkreis == unique(res$landkreis)[x], ], 
       {lines(jahr, deDomains.ex, col=x + 1)
         points(jahr, deDomains, col=x + 1)}))
legend("bottomleft", legend=c(as.character(unique(res$landkreis)), "true points"), 
       col=c(2, 3, 1), lty=c(1, 1, NA), pch=c(NA, NA, 1))

enter image description here

You could also look into the imputeTS::na_seadec function where--among kalman--other algorithms can be chosen, and also frequencies could be detected.


Data:

df_complete <- structure(list(landkreis = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Ahrweile…", 
"Aichach-…"), class = "factor"), jahr = c(2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2007L, 
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L), deDomains = c(NA, 
NA, NA, NA, NA, NA, 22224L, 22460L, 2379L, 22769L, 23268L, NA, 
NA, NA, NA, NA, NA, 21341L, 21393L, 21338L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20"))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you a lot. But the NAs are always replaced with the same value. I was looking for a solution where there is a "trend", meaning that the several NAs are not replaced by only one value. Do you know a command for that? – Laurenz Hamel Jan 21 '20 at 12:25
  • @LaurenzHamel Yes I do, please see edit! You could also look into the `imputeTS::na_seadec` function where--among kalman--other algorithms can be chosen, it's also able to detect frequencies. – jay.sf Jan 21 '20 at 12:43