1

I am working with climate data of several (5) data loggers I used in a field experiment. I have some missing climate data, partially because the loggers were installed after start of the experiment, partially because of defect loggers. For interpolation of my missing data, I have climate data from two nearby weather stations. I'm trying to find a rapid and safe way to replace the missing data in this dataset, because the way I have come up with is very tedious and quite convoluted.

df <- data.frame(date=c("2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21"), 
meantemp1=c(15,17,19,15,19),maxtemp1=c(18,25,27,25,28),
meantemp2=c(13,12,12,18,14),maxtemp2=c(22,25,25,24,26),
meantemp3=c(NA,NA,21,17,21),maxtemp3=c(NA,NA,29,25,29),
meantemp4=c(NA,14,14,20,16),maxtemp4=c(NA,27,27,26,28))#create toy dataset    
df$date <- as.Date(df$date)

View(df)
      date      meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17        15       18        13       22        NA       NA        NA       NA
2 2015-06-18        17       25        12       25        NA       NA        14       27
3 2015-06-19        19       27        12       25        21       29        14       27
4 2015-06-20        15       25        18       24        17       25        20       26
5 2015-06-21        19       28        14       26        21       29        16       28

Let's say the first four column correspond to weather station data, and the last four columns correspond to data logger data. In reality, I look at many more climatic factors. Now, I shorten the dataset to a relevant timescale (irl a month), without any NA values, to calculate correlation.

#create a short dataset without NA values
dfshort <- df[df$date>="2015-06-19"&df$date<="2015-06-21 ",]
dfshort$date<-as.numeric(dfshort$date)#date needs to be transformed to numeric for cor()
corrmatrix <-((cor(dfshort)))
library(reshape)
m <- melt(corrmatrix)#show correlation matrix as a list
m <- m[order(- abs(m$value)), ]#order correlation matrix according to correlation values

View(m)
          X1        X2 value
1       date      date     1
11 meantemp1 meantemp1     1
15 meantemp3 meantemp1     1
16  maxtemp3 meantemp1     1
21  maxtemp1  maxtemp1     1
31 meantemp2 meantemp2     1
35 meantemp4 meantemp2     1
10      date meantemp1     0
46      date meantemp3     0
55      date  maxtemp3     0

To get a better overview which factors are most correlated, I shorten this list to the cases where the 'logger data' are dependent on the 'weather station data'.

m1 <- subset(m, X1 %in% c('meantemp3', 'maxtemp3', 'meantemp4', 'maxtemp4'))
#select logger data for first ("dependent") column
m2 <-subset(m1,X2 %in% c('meantemp1', 'maxtemp1', 'meantemp2', 'maxtemp2'))
#select cases with weather station data for second ("reference")column

View(m2)

      X1        X2      value
15 meantemp3 meantemp1  1.0000000
16  maxtemp3 meantemp1  1.0000000
35 meantemp4 meantemp2  1.0000000
45  maxtemp4  maxtemp2  1.0000000
27  maxtemp4  maxtemp1  0.9819805
17 meantemp4 meantemp1 -0.9449112
24 meantemp3  maxtemp1  0.9449112
26 meantemp4  maxtemp1 -0.7857143
36  maxtemp4 meantemp2 -0.6546537
44 meantemp4  maxtemp2 -0.6546537

Now I mark down the highest correlations for the 'logger data', and create lm's following this post:

#formulate linear models
model.meantemp3 <- lm(meantemp3 ~ meantemp1, data = df)
model.maxtemp3 <- lm(maxtemp3 ~ meantemp1, data = df)
model.meantemp4 <- lm(meantemp4 ~ meantemp2, data = df)
model.maxtemp4 <- lm(maxtemp4 ~ maxtemp2, data = df)
#predict values as column
df$predict.meantemp3 = predict(model.meantemp3, newdata = df)
df$predict.maxtemp3 = predict(model.maxtemp3, newdata = df)
df$predict.meantemp4 = predict(model.meantemp4, newdata = df)
df$predict.maxtemp4 = predict(model.maxtemp4, newdata = df)
# replace (only) NAs with predictions
df$meantemp3 = ifelse(is.na(df$meantemp3), df$predict.meantemp3, df$meantemp3)
df$maxtemp3 = ifelse(is.na(df$maxtemp3), df$predict.maxtemp3,df$maxtemp3)
df$meantemp4 = ifelse(is.na(df$meantemp4), df$predict.meantemp4, df$meantemp4)
df$maxtemp4 = ifelse(is.na(df$maxtemp4), df$predict.maxtemp4, df$maxtemp4)
#tadaa!
df<- df[c(-10:-13)] #drop column we are not interested in
head(df) #dataset without NA's

       date     meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17        15       18        13       22        17       25        15       24
2 2015-06-18        17       25        12       25        19       27        14       27
3 2015-06-19        19       27        12       25        21       29        14       27
4 2015-06-20        15       25        18       24        17       25        20       26
5 2015-06-21        19       28        14       26        21       29        16       28

There has to be a more concise and less error-prone way to do this, and I can't be the only one with this issue, as this unanswered stackexchange question suggests. I have looked for packages to do this (package "mice" e.g.), but they tend to end up in quite complicated outputs. Geographers seem to agree that simple linear models are too primitive for temperature data imputation. However, my climate data are highly correlated, so I want to do it this way for the sake of simplicity.

Help is highly appreciated!

Benno
  • 97
  • 8
  • 1
    I'm not sure how complicated you want to make this, but the most accurate approach would likely be to use a distance-weighted interpolation. Is there missing data for all 5 data loggers or only some? Where are the weather stations located compared to the loggers? Where are the loggers located compared to each other? You may be better off imputing missing data on one logger from another logger instead of weather stations depending on a variety of factors. Also, you are probably better off selecting your prediction approach based on model fit statistics like AIC or RMSE instead of correlation. – Cotton.Rockwood Sep 20 '18 at 18:55
  • if you want to stick with a simple linear interpolation, you can just fit the linear models and select based on their fit rather than using correlation first. I would look into using a dplyr/purr and `map()` approach. – Cotton.Rockwood Sep 20 '18 at 18:59
  • 1
    This may give you a sense for using `map` for your purposes: https://sebastiansauer.github.io/purrr-map-no-do/ – Cotton.Rockwood Sep 20 '18 at 19:21
  • Thanks for your comments, @Cotton.Rockwood ! An AIC-based selection of predicitve dataset sounds sensible. The data is sometimes missing in all sites, and sometimes only in one or two loggers; no data logger is further away then 15km, but the sites are along an altitudinal gradient (sometimes >800hm), and sometimes show quite different temperatures than adjacent weather stations, a result of different measurement methods. Thanks for the tip on the map() approach, I will have a look into that. – Benno Sep 20 '18 at 19:53
  • @Benno: maybe use the [PRISM](https://www.researchgate.net/post/How_to_implement_PRISM_method_for_mapping_precipitation_on_daily_scale) or the method by [John Abatzoglou et al. (2013)](http://www.climatologylab.org/gridmet.html)? – Tung Sep 21 '18 at 08:07
  • Thanks, @Tung , I think the PRISM approach might be a bit overkill for two reference datasets, but I will see if I can get some of the quality control measures to work. – Benno Sep 24 '18 at 14:56

0 Answers0