1

I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.

I would like to create a third inputed/interpolated set of data in which the solution is:

informed by the modeled values where observations are missing crosses the highly weighted observations and is informed by the lower weighted observations

So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.

library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)

test_data <-  data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
                       site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
                       observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
                       weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
                       modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))

test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)

tslm_interpolate <- test_tsibble %>%
  group_by(site_name) %>% 
  model(lm  = TSLM(observed~modeled)) %>%
  fabletools::interpolate(test_tsibble)

tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>% 
  dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)

tslm_interpolate %>% 
  ggplot(aes(x=obs_year))+
  geom_line(aes(y = imputed), color = "blue")+
  geom_line(aes(y = modeled), color = "red")+
  geom_point(aes(y = observed), color = "green")+
  facet_wrap(~site_name, scales = "free_y")
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • Is your reason for imputing values because you are trying to use regression for your predictions (or some other modeling method that can't handle missing data)? If 'yes', I'd recommend using a modeling method that can handle missing data (gbm or rf). – Monk Aug 30 '19 at 20:48
  • There are a number of reasons we need a full time series that passes through the observations which are more policy/optics related than statistical requirements. – Bob SomeAle Sep 03 '19 at 15:32
  • I know this question is a year old, but it seems better suited at [stats.se] since it's about statistical methods more than debugging or improving code – camille Aug 11 '20 at 16:01

0 Answers0