1

I'm trying to load data from Quandl with collapse = "monthly". Some of the values are only available in a yearly or halve-yearly fashion. Some other values are only available within certain periods of time. This leaves me with a lot of inhomogeneous data. How can I fill the yearly and halve-yearly data in a "Last observation carried forward" fashion and the remaining NAs with 0?

Here is my idea of the data I got and the one I want to have at the end:

library(tibble)

set.seed(4711)

# How do I get from:
#
df.start <- data_frame(
  Date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-01"), "1 month"),
  B = rep(NA, 120),
  C = c(rep(NA, 50), rnorm(120 - 50)),
  D = rep(c(rnorm(1), rep(NA, 11)), 10),
  E = c(rep(NA, 24), rep(c(rnorm(1), rep(NA, 11)), 8)),
  F = c(rep(NA, 45), rnorm(50), rep(NA, 25)),
  G = c(rep(NA, 24), rep(c(rnorm(1), rep(NA, 11)), 6), rep(NA, 24)),
  H = c(rep(NA, 10), rnorm(20), rep(NA, 16), rnorm(37), rep(NA, 37)),
  I = rep(c(rnorm(1), rep(NA, 5)), 20)
)
#
# To:
#
df.end <- data_frame(
  Date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-01"), "1 month"),
  B = rep(0, 120),
  C = c(rep(0, 50), rnorm(120 - 50)),
  D = rep(rnorm(10), each = 12),
  E = c(rep(0, 24), rep(rnorm(8), each = 12)),
  F = c(rep(0, 45), rnorm(50), rep(0, 25)),
  G = c(rep(0, 24), rep(rnorm(6), each = 12), rep(0, 24)),
  H = c(rep(0, 10), rnorm(20), rep(0, 16), rnorm(37), rep(0, 37)),
  I = rep(rnorm(20), each = 6)
)
#
# Automatically?
#
Someone2
  • 421
  • 2
  • 15
  • See `?zoo::na.locf` – A. Suliman Dec 06 '18 at 11:10
  • Yeah sure, but it only works for some of the columns, while I'm looking for a solution that works for every column. Something like `df.end <- mutate_all(df.start ,function(x){#TODO})`. – Someone2 Dec 06 '18 at 11:14
  • So since whether to fill down with the last non-NA entry depends on `Date`, you should clearly explain this dependence so that we don't need to analyze your example. – Julius Vainora Dec 06 '18 at 11:55
  • It depends on the type of data, if it is monthly (use `na.replace`) or yearly (use `na.locf`), not on the `Date`. – Someone2 Dec 06 '18 at 12:03

1 Answers1

0

You can use fill to fill the NAs with the last non-empty value (except for the Date column), and then replace the remaining NAs by 0. We do these operations grouped by year.

library(tidyverse)
library(lubridate)

df.end <- df.start %>%
  mutate(year = year(Date)) %>%
  group_by(year) %>%
  fill(., colnames(df.start[-1])) %>%
  replace(., is.na(.), 0) %>%
  ungroup() %>%
  select(-year)
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
NRLP
  • 568
  • 3
  • 16
  • Thank you for your answer, but what about the last values of column H and I? They are also filled. – Someone2 Dec 06 '18 at 12:56
  • @Someone2 They get filled out as expected: with the last non-empty value from the respective columns. Unless I misunderstood your question: did you expect them to be filled differently? – NRLP Dec 06 '18 at 12:59
  • Ideally they should contain a zero, like in `df.end`. Since this is the hard part about it. Maybe using a gateway to fill the inner part differently would work. – Someone2 Dec 06 '18 at 13:02
  • @Someone2 Hmm.. I am not sure now I understand the question. I checked your `df.end`, and I think I do not understand the filling pattern. For instance, on column `D` the first value in `df.start` is `-1.712529`, but in`df.end` it is filled out by `0.02222674`. Where did `0.02222674` come from? – NRLP Dec 06 '18 at 13:08
  • Yeah this is an error of mine due to the randomness, it should be the same value. I just difference between a value, a `NA` and a zero. Where the values of `df.start` and `df.end` should be the same. – Someone2 Dec 06 '18 at 13:11
  • 1
    @Someone2 I think I figured it out. The values should be grouped by year. Please see my updated answer. – NRLP Dec 06 '18 at 13:30