0

Is there any way I can use imputeTS for time series prediction with multiple regression variables? I am having blanks in y, a minute level data with NAs, while all my X(x1,x2,.. xn) are continuous variable ae without NAs

DateTime        Processed   Avg     1_Q   Median    3_Q

04/01/20 3:22       3       1.8      1      2       2.5
04/01/20 3:23       3       1.6      1      1       2
04/01/20 3:24       1       1.5      1      1       2
04/01/20 3:25       1       1.2      1      1       1
04/01/20 3:28       1       1.1      1      1       1
04/01/20 3:29       1       1.7      1      1.5     2.8
04/01/20 3:32       1       1.6      1      1       2
04/01/20 3:33       2       1.4      1      1       2
04/01/20 3:35       1       1.4      1      1       1.8
04/01/20 3:38               1.4      1      1       2
04/01/20 3:39       2       1.4      1      1       2
04/01/20 3:41               1.2      1      1       2
04/01/20 3:42               1.2      1      1       1.8
04/01/20 3:44       1       1.3      1      1       2
04/01/20 3:45       1       1.2      1      1       1
04/01/20 3:46       1       1.6      1      2       2
04/01/20 3:47       1       1.8      1      2       2
04/01/20 3:48               1.2      -      1       2
04/01/20 3:52               1.3      1      1       1.3
04/01/20 3:53       2       1.9      1      2       2
04/01/20 3:54       1       0.9      1      1       1
04/01/20 3:56       1       1.3      1      1       1
04/01/20 3:57       2       1.1      1      1       1

a complete data set can be find here

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55

1 Answers1

0

imputeTS is really good for time series imputation (where you employ correlations of one variable in time)

In your case there is a lot of useful information in the other variables (inter-variable correlations). imputeTS performs univariate time series imputation, thus it only looks at each variable and it's correlation in time separately.

Since your variables Avg, 1_Q, Median,3_Q seem to be highly correlated to Processed (where your missing data are) probably another package is a better choice. missForest, imputeR and other packages that employ inter-variable correlations (but not inter-time correlations) would be a better choice.

Might be, that you get even better results, if you come up with your own imputation routine for the missing data. The missing data always seems to be in Processed and Avg, Median, 3_Q seem to be statistics about Processed. Maybe e.g. using always the Avg rounded to the nearest number as replacement for Processed is already quite good.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55