I need to pivot some wide time-series data, that is varying width regard to width using tidyr's pivot_longer()
.
The data is quarterly, but I receive the data in both years-blocks (with with four quarters) and in six months blocks (with only two quarters), i.e. the data is varying with regard to width.
I would like to find a simple and flexible solution that can be used in a loop, as I need to import many years and six months blocks (and, as I need to convince my research group to use R, I am asking here for a simple, smart, and clean solution using (preferably) tidyverse).
The data looks kinda like this in the years-blocks,
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
and like this in the six months blocks,
dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
i.e. (for the six months block)
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "" "ABC" "" "" "" "" "" "" ""
2 "" "some info" "" "" "" "" "" "" ""
3 "" "Store A" "" "" "" "" "" "" ""
4 "" "" "" "" "" "" "" "" ""
5 "" "As of 31/03/~ "" "" "" "As of ~ "" "" ""
6 "" "label1" "" "label~ "" "label1" "" "labe~ ""
7 "" "" "labe~ "label~ "labe~ "" "lab~ "labe~ "labe~
8 "peach" "2" "0.1" "10" "0.3" "4" "0.2~ "11" "0.4"
9 "dragon~ "3" "0.2" "11" "0.1" "6" "0.3~ "12" "0.1"
10 "honeyd~ "4" "0.3" "12" "0.4" "8" "0.4~ "13" "0.3"
11 "huckle~ "8" "0.8" "9" "0.2" "16" "0.5~ "10" "0.2"
12 "" "(a) some use~ "" "" "" "" "" "" ""
13 "" "(b) more not~ "" "" "" "" "" "" ""
in dta_wide2
the date keys are floting around like this
> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA "2020-03-31" NA NA NA
[6] "2020-06-30" NA NA NA
so I try to tidy it up like this
dta_wide2 %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble
# A tibble: 13 x 12
V1 store date1 V2 V3 V4 date2 V5 V6 V7
<chr> <chr> <date> <chr> <chr> <chr> <date> <chr> <chr> <chr>
1 "" Stor~ 2020-03-31 "ABC" "" "" 2020-06-30 "" "" ""
2 "" Stor~ 2020-03-31 "som~ "" "" 2020-06-30 "" "" ""
3 "" Stor~ 2020-03-31 "Sto~ "" "" 2020-06-30 "" "" ""
4 "" Stor~ 2020-03-31 "" "" "" 2020-06-30 "" "" ""
5 "" Stor~ 2020-03-31 "As ~ "" "" 2020-06-30 "" "As ~ ""
6 "" Stor~ 2020-03-31 "lab~ "" "lab~ 2020-06-30 "" "lab~ ""
7 "" Stor~ 2020-03-31 "" "lab~ "lab~ 2020-06-30 "lab~ "" "lab~
8 "pea~ Stor~ 2020-03-31 "2" "0.1" "10" 2020-06-30 "0.3" "4" "0.2~
9 "dra~ Stor~ 2020-03-31 "3" "0.2" "11" 2020-06-30 "0.1" "6" "0.3~
10 "hon~ Stor~ 2020-03-31 "4" "0.3" "12" 2020-06-30 "0.4" "8" "0.4~
11 "huc~ Stor~ 2020-03-31 "8" "0.8" "9" 2020-06-30 "0.2" "16" "0.5~
12 "" Stor~ 2020-03-31 "(a)~ "" "" 2020-06-30 "" "" ""
13 "" Stor~ 2020-03-31 "(b)~ "" "" 2020-06-30 "" "" ""
# ... with 2 more variables: V8 <chr>, V9 <chr>
Now, I need to pivot it longer using, if I get it corret, pivot_longer
, however my challenge is how -- when I also get data that looks like dta_wide1
, i.e. with four quarters -- do I do it in a flexible way that I can use for both dta_wide1
and dta_wide2
.
I've been working on this for some time and any help to make it work, simplyfy or clean it up will be very much appriciated.
Here's where I'm currently at, but it’s not right, not flexible, and not simply
dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,] %>%
pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)
Or, some generic snippet, that is not neither simple, smart, or clean, but it could be used to get the positions of the dates in both sample data in a loop
dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6
Or, cleaner,
dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1] 2 6 10 14
update 2020-06-08 07:45:18Z
My goal is to combine the long data sets to able to plot the data, (Wimpel suggest below that I combine my varying wide data sets, i.e. dta_wide1
, dta_wide2
, ...
dta_widen
, using a lapply() call) I imagine data that looks something like this,
> dta_long
# A tibble: 96 x 5
product label value date store
<chr> <chr> <dbl> <date> <chr>
1 peach label1 7 2019-03-31 Store A
2 peach label2 0.5 2019-03-31 Store A
3 peach label4a 21 2019-03-31 Store A
4 peach label4b 0.3 2019-03-31 Store A
5 peach label1 5 2019-06-30 Store A
6 peach label2 0.46 2019-06-30 Store A
7 peach label4a 19 2019-06-30 Store A
8 peach label4b 0.4 2019-06-30 Store A
9 peach label1 4 2019-09-30 Store A
10 peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows
and then ggplot2/plotting the date with something like this,
dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) +
geom_line() + scale_x_date(date_breaks = "3 months",
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))