I am working with climate data of several (5) data loggers I used in a field experiment. I have some missing climate data, partially because the loggers were installed after start of the experiment, partially because of defect loggers. For interpolation of my missing data, I have climate data from two nearby weather stations. I'm trying to find a rapid and safe way to replace the missing data in this dataset, because the way I have come up with is very tedious and quite convoluted.
df <- data.frame(date=c("2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21"),
meantemp1=c(15,17,19,15,19),maxtemp1=c(18,25,27,25,28),
meantemp2=c(13,12,12,18,14),maxtemp2=c(22,25,25,24,26),
meantemp3=c(NA,NA,21,17,21),maxtemp3=c(NA,NA,29,25,29),
meantemp4=c(NA,14,14,20,16),maxtemp4=c(NA,27,27,26,28))#create toy dataset
df$date <- as.Date(df$date)
View(df)
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 NA NA NA NA
2 2015-06-18 17 25 12 25 NA NA 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
Let's say the first four column correspond to weather station data, and the last four columns correspond to data logger data. In reality, I look at many more climatic factors. Now, I shorten the dataset to a relevant timescale (irl a month), without any NA values, to calculate correlation.
#create a short dataset without NA values
dfshort <- df[df$date>="2015-06-19"&df$date<="2015-06-21 ",]
dfshort$date<-as.numeric(dfshort$date)#date needs to be transformed to numeric for cor()
corrmatrix <-((cor(dfshort)))
library(reshape)
m <- melt(corrmatrix)#show correlation matrix as a list
m <- m[order(- abs(m$value)), ]#order correlation matrix according to correlation values
View(m)
X1 X2 value
1 date date 1
11 meantemp1 meantemp1 1
15 meantemp3 meantemp1 1
16 maxtemp3 meantemp1 1
21 maxtemp1 maxtemp1 1
31 meantemp2 meantemp2 1
35 meantemp4 meantemp2 1
10 date meantemp1 0
46 date meantemp3 0
55 date maxtemp3 0
To get a better overview which factors are most correlated, I shorten this list to the cases where the 'logger data' are dependent on the 'weather station data'.
m1 <- subset(m, X1 %in% c('meantemp3', 'maxtemp3', 'meantemp4', 'maxtemp4'))
#select logger data for first ("dependent") column
m2 <-subset(m1,X2 %in% c('meantemp1', 'maxtemp1', 'meantemp2', 'maxtemp2'))
#select cases with weather station data for second ("reference")column
View(m2)
X1 X2 value
15 meantemp3 meantemp1 1.0000000
16 maxtemp3 meantemp1 1.0000000
35 meantemp4 meantemp2 1.0000000
45 maxtemp4 maxtemp2 1.0000000
27 maxtemp4 maxtemp1 0.9819805
17 meantemp4 meantemp1 -0.9449112
24 meantemp3 maxtemp1 0.9449112
26 meantemp4 maxtemp1 -0.7857143
36 maxtemp4 meantemp2 -0.6546537
44 meantemp4 maxtemp2 -0.6546537
Now I mark down the highest correlations for the 'logger data', and create lm's following this post:
#formulate linear models
model.meantemp3 <- lm(meantemp3 ~ meantemp1, data = df)
model.maxtemp3 <- lm(maxtemp3 ~ meantemp1, data = df)
model.meantemp4 <- lm(meantemp4 ~ meantemp2, data = df)
model.maxtemp4 <- lm(maxtemp4 ~ maxtemp2, data = df)
#predict values as column
df$predict.meantemp3 = predict(model.meantemp3, newdata = df)
df$predict.maxtemp3 = predict(model.maxtemp3, newdata = df)
df$predict.meantemp4 = predict(model.meantemp4, newdata = df)
df$predict.maxtemp4 = predict(model.maxtemp4, newdata = df)
# replace (only) NAs with predictions
df$meantemp3 = ifelse(is.na(df$meantemp3), df$predict.meantemp3, df$meantemp3)
df$maxtemp3 = ifelse(is.na(df$maxtemp3), df$predict.maxtemp3,df$maxtemp3)
df$meantemp4 = ifelse(is.na(df$meantemp4), df$predict.meantemp4, df$meantemp4)
df$maxtemp4 = ifelse(is.na(df$maxtemp4), df$predict.maxtemp4, df$maxtemp4)
#tadaa!
df<- df[c(-10:-13)] #drop column we are not interested in
head(df) #dataset without NA's
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 17 25 15 24
2 2015-06-18 17 25 12 25 19 27 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
There has to be a more concise and less error-prone way to do this, and I can't be the only one with this issue, as this unanswered stackexchange question suggests. I have looked for packages to do this (package "mice" e.g.), but they tend to end up in quite complicated outputs. Geographers seem to agree that simple linear models are too primitive for temperature data imputation. However, my climate data are highly correlated, so I want to do it this way for the sake of simplicity.
Help is highly appreciated!