0

I am trying to autopopulate data. I need a command to fill the gaps between dates: Intended Output shows the gaps with similar precedence fills. The filling requires to fill the missing date entries.

In R, how to autopopulate the data by filling the gaps between dates in some date sequence?


Minimal working examples

Input for the dates sequence (2016-12-25, 2017-01-05)

> aa<- data.frame(a=c(1,11,111),b=c(2,22,222),length=c(3,5,1),date=c(as.Date("28.12.2016",format="%d.%m.%Y"), as.Date("30.12.2016",format="%d.%m.%Y"), as.Date("01.01.2017",format="%d.%m.%Y")))
> 
> dateSeq<-seq.Date(as.Date("2016/12/25"), as.Date("2017/01/05"), "day")
> dateSeq
 [1] "2016-12-25" "2016-12-26" "2016-12-27" "2016-12-28" "2016-12-29"
 [6] "2016-12-30" "2016-12-31" "2017-01-01" "2017-01-02" "2017-01-03"
[11] "2017-01-04" "2017-01-05"

>
> aa
    a   b length       date
1   1   2      3 2016-12-28
2  11  22      5 2016-12-30
3 111 222      1 2017-01-01

that shows the data recorded. The NA-fills, shown to some extent by Joel Wilson, is done such that

dateSeq<-seq.Date(as.Date("2016/12/25"), as.Date("2017/01/05"), "day")
df<-data.frame(dateSeq)
df$date = as.Date(df$date, format = "%Y-%m-%d")
merge(df, aa, by = "date", all.x= TRUE)
     a   b       length  date
0.1  NA  NA      NA      2016-12-25
0.2  NA  NA      NA      2016-12-26
0.3  NA  NA      NA      2016-12-27
1    1   2       3       2016-12-28
0.4  NA  NA      NA      2016-12-29
2    11  22      5       2016-12-30
0.5  NA  NA      NA      2016-12-31
3    111 222     1       2017-01-01
0.6  NA  NA      NA      2017-01-02
0.7  NA  NA      NA      2017-01-03
0.8  NA  NA      NA      2017-01-04
0.9  NA  NA      NA      2017-01-05
0.10 NA  NA      NA      2017-01-06

where our goal is to fill the NA entries with something called similar precedence approach that depends on its neighbours.

Intended Output with the input date-complemented and similar precedence fills

     a   b       length  date
0.1  0.9 2       3       2016-12-25
0.2  1   2.1     2       2016-12-26
0.3  0.8 2.2     3       2016-12-27
1    1   2       3       2016-12-28
0.4  10  20      4       2016-12-29
2    11  22      5       2016-12-30
0.5  80  150     3       2016-12-31
3    111 222     1       2017-01-01
0.6  100 130     5       2017-01-02
0.7  50  200     3       2017-01-03
0.8  20  100     2       2017-01-04
0.9  14  40      5       2017-01-05
0.10 80  140     4       2017-01-06
Community
  • 1
  • 1
  • 1
    all that you need to do is merge to get firstintended output – joel.wilson Dec 28 '16 at 12:41
  • huh?... how did you get the values for `a`, `b` and `length`? – Sotos Dec 28 '16 at 13:14
  • @Sotos by `similar precedence` approach where values similar to its neighbours, I originally used LOCF (last observation carried forward) but now trying to find better alternatives to filling. – Regan Alpha Dec 28 '16 at 13:18
  • is there a function or any constrains to define your requirements of `similar precedence`? – Sotos Dec 28 '16 at 13:21
  • @Sotos no, I try to find method that is somewhat close to the values around it: moving average models, auto-regression models and ARIMA models could be some approaches where the data is assumed to act as time-series. I have no idea whether they are implemented to R for this kind of case. – Regan Alpha Dec 28 '16 at 13:26
  • but you only have 1 value for each var to begin with. How are you going to forecast then? – Sotos Dec 28 '16 at 13:29
  • @Sotos for example by moving averages: 2017-01-02 depends on all of its precedent values where the closest, 2017-01-01, affects it most. Values such as 3 (2016-12-28), 5 (2016-12-30) and 1 (2017-01-01) affect it. The values bteween 2016-12-25 and 2016-12-27 cannot be forecasted by the precedent values but we can estimate the moving averages over the columns. – Regan Alpha Dec 28 '16 at 13:38

1 Answers1

1

My point was:

dateSeq<-seq.Date(as.Date("2016/12/25"), as.Date("2017/01/05"), "day")
df<-data.frame(dateSeq)
df
#          date
# 1: 2016-12-25
# 2: 2016-12-26
# 3: 2016-12-27
# 4: 2016-12-28
# 5: 2016-12-29
# 6: 2016-12-30
# 7: 2016-12-31
# 8: 2017-01-01
# 9: 2017-01-02
#10: 2017-01-03
#11: 2017-01-04
#12: 2017-01-05
#13: 2017-01-06

df$date = as.Date(df$date, format = "%Y-%m-%d")
merge(df, aa, by = "date", all.x= TRUE)
#          date   a   b length
# 1: 2016-12-25  NA  NA     NA
# 2: 2016-12-26  NA  NA     NA
# 3: 2016-12-27  NA  NA     NA
# 4: 2016-12-28   1   2      3
# 5: 2016-12-29  NA  NA     NA
# 6: 2016-12-30  11  22      5
# 7: 2016-12-31  NA  NA     NA
# 8: 2017-01-01 111 222      1
# 9: 2017-01-02  NA  NA     NA
#10: 2017-01-03  NA  NA     NA
#11: 2017-01-04  NA  NA     NA
#12: 2017-01-05  NA  NA     NA
#13: 2017-01-06  NA  NA     NA
joel.wilson
  • 8,243
  • 5
  • 28
  • 48