0

I have a large dataset made of multiple irregular timeseries with a specific date column for each series. I want to convert this dataset into a dataframe with a unique date column or into a zoo object.

I tried read_xls(), read.zoo(). I tried to reshape with pivot_longer(). I searched on the web but I have not found any solution yet.

date1 Var1 date2 Var2 date3 Var3
2023-01-13 100.1 2023-01-11 99.7 2022-11-24 102.3
2023-01-16 104.5 2023-01-12 NA 2022-11-25 99.9
2023-01-17 101.6 2023-01-13 99.9 2022-11-28 99.3
2023-01-18 101.8 2023-01-16 99.1 2022-11-29 NA
2023-01-19 NA 2023-01-17 99.5 2022-11-30 NA
Bertrand G
  • 37
  • 6

2 Answers2

1

I'll provide this answer until someone comes up with something more elegant.

library(tidyverse)
dat <- structure(list(date1 = structure(c(19370, 19373, 19374, 19375, 
                                          19376), class = "Date"), Var1 = c(100.1, 104.5, 101.6, 101.8, 
                                                                            NA), date2 = structure(c(19368, 19369, 19370, 19373, 19374
                                                                            ), class = "Date"), Var2 = c(99.7, 99.8, 99.9, NA, NA), date3 = structure(c(19320, 
                                                                                                                                                            19321, 19324, 19325, 19326), class = "Date"), Var3 = c(102.3, 
                                                                                                                                                                                                                   99.9, 99.3, 100.5, 100.1)), row.names = c(NA, -5L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                 "tbl", "data.frame"))
dat2 <- dat %>%
  pivot_longer(cols = contains("date"),
               names_to = "date") %>%
  select(date, value, contains("Var")) %>%
  arrange(date) %>%
  mutate(id = group_indices(group_by(., date))) %>%
  select(contains("Var"), date = value, id)

var_nms <- names(select(dat2, contains("Var"))) 
for (i in seq_along(var_nms)){
  dat2[[var_nms[i]]] <- if_else(dat2$id == i, dat2[[var_nms[i]]],
                                NA_real_)
}

out <- dat2 %>%
  mutate(Var = do.call(coalesce, pick(contains("Var")))) %>%
  select(date, Var)

out
#> # A tibble: 15 x 2
#>    date         Var
#>    <date>     <dbl>
#>  1 2023-01-13 100. 
#>  2 2023-01-16 104. 
#>  3 2023-01-17 102. 
#>  4 2023-01-18 102. 
#>  5 2023-01-19  NA  
#>  6 2023-01-11  99.7
#>  7 2023-01-12  99.8
#>  8 2023-01-13  99.9
#>  9 2023-01-16  NA  
#> 10 2023-01-17  NA  
#> 11 2022-11-24 102. 
#> 12 2022-11-25  99.9
#> 13 2022-11-28  99.3
#> 14 2022-11-29 100. 
#> 15 2022-11-30 100.

Created on 2023-04-12 with reprex v2.0.2

NicChr
  • 858
  • 1
  • 9
1

Using the data shown reproducibly in the Note at the end, assume that what is wanted is a zoo object with a separate column for each non-date column.

First create a grouping vector g which looks like c("Var1", "Var1", "Var2", "Var2", "Var3", "Var3") and then convert DF to a list and split it by g giving s. Finally convert each component of s to a zoo object and merge them using cbind. (If a data frame is wanted use fortify.zoo on the result.)

library(zoo)

g <- rep(names(DF)[sapply(DF, is.numeric)], each = 2)
s <- split(as.list(DF), g)
do.call("cbind", lapply(s, function(x) read.zoo(as.data.frame(x))))

giving:

            Var1 Var2  Var3
2022-11-24    NA   NA 102.3
2022-11-25    NA   NA  99.9
2022-11-28    NA   NA  99.3
2022-11-29    NA   NA    NA
2022-11-30    NA   NA    NA
2023-01-11    NA 99.7    NA
2023-01-12    NA   NA    NA
2023-01-13 100.1 99.9    NA
2023-01-16 104.5 99.1    NA
2023-01-17 101.6 99.5    NA
2023-01-18 101.8   NA    NA
2023-01-19    NA   NA    NA

This could be represented as a pipeline like this:

g <- rep(names(DF)[sapply(DF, is.numeric)], each = 2)
DF |>
  as.list() |>
  split(g) |>
  lapply(function(x) read.zoo(as.data.frame(x))) |>
  do.call(what = "cbind")

or

DF |>
  as.list() |>
  (\(x) split(x, rep(names(x)[sapply(x, is.numeric)], each = 2)))() |>
  lapply(\(x) read.zoo(as.data.frame(x))) |>
  do.call(what = "cbind")

Note

Lines <- "date1 Var1 date2 Var2 date3 Var3
2023-01-13 100.1 2023-01-11 99.7 2022-11-24 102.3
2023-01-16 104.5 2023-01-12 NA 2022-11-25 99.9
2023-01-17 101.6 2023-01-13 99.9 2022-11-28 99.3
2023-01-18 101.8 2023-01-16 99.1 2022-11-29 NA
2023-01-19 NA 2023-01-17 99.5 2022-11-30 NA"
DF <- read.table(text = Lines, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • actually works with this example, but when running on my dataset I get the following error message: "Error in read.zoo(as.data.frame(x)) : index has 206 bad entries at data rows: 348 349 ..." – Bertrand G Apr 12 '23 at 13:03
  • There is no duplicate. The error comes from NAs at the end of my dataset. read.zoo is unable to read from the rows filled NAs. I read the threads about the difference between empty cells and NA. I tried to change their value using different strategies like na.fill() and other options like na ="". But it does not work. – Bertrand G Apr 13 '23 at 09:28
  • 1
    It is only NA dates that are the problem. NA's in the values are not an issue. Replace `as.data.frame(x)` with `na.omit(as.data.frame(x))` – G. Grothendieck Apr 13 '23 at 11:46