6

I need to some wide time-series data, that is varying width regard to width using 's pivot_longer().

The data is quarterly, but I receive the data in both years-blocks (with with four quarters) and in six months blocks (with only two quarters), i.e. the data is varying with regard to width.

I would like to find a simple and flexible solution that can be used in a loop, as I need to import many years and six months blocks (and, as I need to convince my research group to use R, I am asking here for a simple, smart, and clean solution using (preferably) ).

The data looks kinda like this in the years-blocks,

dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

and like this in the six months blocks,

dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

i.e. (for the six months block)

# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1       V2             V3     V4      V5     V6       V7    V8     V9    
<chr>    <chr>          <chr>  <chr>   <chr>  <chr>    <chr> <chr>  <chr> 
   1 ""       "ABC"          ""     ""      ""     ""       ""    ""     ""    
2 ""       "some info"    ""     ""      ""     ""       ""    ""     ""    
3 ""       "Store A"      ""     ""      ""     ""       ""    ""     ""    
4 ""       ""             ""     ""      ""     ""       ""    ""     ""    
5 ""       "As of 31/03/~ ""     ""      ""     "As of ~ ""    ""     ""    
6 ""       "label1"       ""     "label~ ""     "label1" ""    "labe~ ""    
7 ""       ""             "labe~ "label~ "labe~ ""       "lab~ "labe~ "labe~
8 "peach"  "2"            "0.1"  "10"    "0.3"  "4"      "0.2~ "11"   "0.4" 
9 "dragon~ "3"            "0.2"  "11"    "0.1"  "6"      "0.3~ "12"   "0.1" 
10 "honeyd~ "4"            "0.3"  "12"    "0.4"  "8"      "0.4~ "13"   "0.3" 
11 "huckle~ "8"            "0.8"  "9"     "0.2"  "16"     "0.5~ "10"   "0.2" 
12 ""       "(a) some use~ ""     ""      ""     ""       ""    ""     ""    
13 ""       "(b) more not~ ""     ""      ""     ""       ""    ""     ""    

in dta_wide2 the date keys are floting around like this

> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA           "2020-03-31" NA           NA           NA          
[6] "2020-06-30" NA           NA           NA      

so I try to tidy it up like this

dta_wide2 %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble

# A tibble: 13 x 12
   V1    store date1      V2    V3    V4    date2      V5    V6    V7   
   <chr> <chr> <date>     <chr> <chr> <chr> <date>     <chr> <chr> <chr>
 1 ""    Stor~ 2020-03-31 "ABC" ""    ""    2020-06-30 ""    ""    ""   
 2 ""    Stor~ 2020-03-31 "som~ ""    ""    2020-06-30 ""    ""    ""   
 3 ""    Stor~ 2020-03-31 "Sto~ ""    ""    2020-06-30 ""    ""    ""   
 4 ""    Stor~ 2020-03-31 ""    ""    ""    2020-06-30 ""    ""    ""   
 5 ""    Stor~ 2020-03-31 "As ~ ""    ""    2020-06-30 ""    "As ~ ""   
 6 ""    Stor~ 2020-03-31 "lab~ ""    "lab~ 2020-06-30 ""    "lab~ ""   
 7 ""    Stor~ 2020-03-31 ""    "lab~ "lab~ 2020-06-30 "lab~ ""    "lab~
 8 "pea~ Stor~ 2020-03-31 "2"   "0.1" "10"  2020-06-30 "0.3" "4"   "0.2~
 9 "dra~ Stor~ 2020-03-31 "3"   "0.2" "11"  2020-06-30 "0.1" "6"   "0.3~
10 "hon~ Stor~ 2020-03-31 "4"   "0.3" "12"  2020-06-30 "0.4" "8"   "0.4~
11 "huc~ Stor~ 2020-03-31 "8"   "0.8" "9"   2020-06-30 "0.2" "16"  "0.5~
12 ""    Stor~ 2020-03-31 "(a)~ ""    ""    2020-06-30 ""    ""    ""   
13 ""    Stor~ 2020-03-31 "(b)~ ""    ""    2020-06-30 ""    ""    ""   
# ... with 2 more variables: V8 <chr>, V9 <chr>

Now, I need to pivot it longer using, if I get it corret, pivot_longer, however my challenge is how -- when I also get data that looks like dta_wide1, i.e. with four quarters -- do I do it in a flexible way that I can use for both dta_wide1 and dta_wide2.

I've been working on this for some time and any help to make it work, simplyfy or clean it up will be very much appriciated.

Here's where I'm currently at, but it’s not right, not flexible, and not simply

dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,]  %>%
   pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)

Or, some generic snippet, that is not neither simple, smart, or clean, but it could be used to get the positions of the dates in both sample data in a loop

dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6

Or, cleaner,

dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1]  2  6 10 14

update 2020-06-08 07:45:18Z

My goal is to combine the long data sets to able to plot the data, (Wimpel suggest below that I combine my varying wide data sets, i.e. dta_wide1, dta_wide2, ... dta_widen, using a lapply() call) I imagine data that looks something like this,

> dta_long
# A tibble: 96 x 5
   product label   value date       store  
   <chr>   <chr>   <dbl> <date>     <chr>  
 1 peach   label1   7    2019-03-31 Store A
 2 peach   label2   0.5  2019-03-31 Store A
 3 peach   label4a 21    2019-03-31 Store A
 4 peach   label4b  0.3  2019-03-31 Store A
 5 peach   label1   5    2019-06-30 Store A
 6 peach   label2   0.46 2019-06-30 Store A
 7 peach   label4a 19    2019-06-30 Store A
 8 peach   label4b  0.4  2019-06-30 Store A
 9 peach   label1   4    2019-09-30 Store A
10 peach   label2   0.1  2019-09-30 Store A
# ... with 86 more rows

and then /plotting the date with something like this,

dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) + 
geom_line() + scale_x_date(date_breaks = "3 months", 
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))

enter image description here

Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • This looks like the type of problem that you will save time by doing manually rather than finding a programmatic way of converting it unfortunately. How are you reading this data into R? It may be easier to import it differently rather than converting it. – Daniel O Jun 02 '20 at 14:11
  • I'm importing it from Excel's Binary (.xlsb) Workbooks using `read_xlsb` from the readxlsb package. I've imagined I could get the indices/positions with the dates and then using that to add the dates and from there use the dates and keys to pivot the data. Something like that. – Eric Fail Jun 02 '20 at 20:44
  • Maybe you can have a look at `tidyxl` and `unipivotr` to find an easier way of extracting your data. – ricoderks Jun 03 '20 at 09:36
  • I wasn’t aware of [tag:unipivotr]. Thanks. I will take a look at it. Unfortunately [tag:tidyxl] does not support the binary file format `.xlsb`, according to it [cran page][(https://cran.r-project.org/web/packages/tidyxl/vignettes/tidyxl.html). – Eric Fail Jun 03 '20 at 10:09

2 Answers2

1

I think that the following will work for your issue. I make the following assumptions:

  1. Groups in each sheet are grouped according to date and that each group contains four variables
  2. That the values you want to represent are always in rows 8-11 of the spreadsheet
  3. A sheet represents a single store and the store name will in the third row and second column of the spreadsheet
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))


## Calculate the number of groups of data in the spreadsheet
d1grps <- (ncol(dta_wide1) - 1) / 4 # Divide by 4 due to assumption #1 above

dnew1 <- as_tibble(dta_wide1) %>%
    ## Take rows that contain data (see assumption #2 above)
    slice(8:11) %>% 
    mutate(
        ## Get dates from the original data frame and remove empty cells
        ## (need `unname()` or else this will overwrite variables)
        !!!as.data.frame(slice(dta_wide1, 5) %>% select(seq(2, ncol(.), 4)) %>% unname()),
        ## Get store from second column (see assumption #3 above)
        store = dta_wide1[3, 2] 
    ) %>%
    ## Create variable names for each variable by group
    setNames(
        c("product", paste0(
            c("label1_", "label2_", "label4a_", "label4b_"), 
            rep(1:d1grps, each = 4)
        ),
        paste0("date_", 1:d1grps), "store" 
    )) %>%
    pivot_longer(
        cols = !any_of(c("product", "store")),
        names_pattern = "(.+)_(.+)",
        names_to = c(".value", "group")
    ) %>%
    mutate(date = lubridate::dmy(sub("As of ", "", date)))

And the final tibble in long format:

# A tibble: 16 x 8
   product     store   group label1 label2 label4a label4b date      
   <chr>       <chr>   <chr> <chr>  <chr>  <chr>   <chr>   <date>    
 1 peach       Store A 1     7      0.5    21      0.3     2019-03-31
 2 peach       Store A 2     5      0.46   19      0.4     2019-06-30
 3 peach       Store A 3     4      0.1    21      0.3     2019-09-30
 4 peach       Store A 4     8      0.7    24      0.3     2019-12-31
 5 dragonfruit Store A 1     5      0.4    21      0.1     2019-03-31
 6 dragonfruit Store A 2     2      0.72   22      0.1     2019-06-30
 7 dragonfruit Store A 3     1      0.3    23      0.1     2019-09-30
 8 dragonfruit Store A 4     6      0.87   25      0.1     2019-12-31
 9 honeydew    Store A 1     6      0.8    87      0.4     2019-03-31
10 honeydew    Store A 2     3      0.7    85      0.3     2019-06-30
11 honeydew    Store A 3     4      0.6    71      0.4     2019-09-30
12 honeydew    Store A 4     9      0.55   99      0.4     2019-12-31
13 huckleberry Store A 1     1      0.3    21      0.2     2019-03-31
14 huckleberry Store A 2     7      0.8    25      0.2     2019-06-30
15 huckleberry Store A 3     8      0.22   27      0.2     2019-09-30
16 huckleberry Store A 4     9      0.33   35      0.2     2019-12-31
 

And the plot that you wanted:

ggplot(dnew1, aes(x = date, y = label2, color = product, group = product)) +
    geom_line()

The code should be amenable to an expanding number of dates in each spreadsheet, and you could write a function to programmatically find the number of groups rather than relying on changing the variable d1grps to set the number of groups.

Even though this should work, it might be less of a headache for you to write the contents into a text file and manipulate the text rather than use dplyr in this way.

mikebader
  • 1,075
  • 3
  • 12
0

I saves your two sample data-sets and stored them in separate .xlsb-files. data looks like this:

enter image description here

enter image description here

maybe this will help... the solution works for both samnple sets provided, so give it a go. The code is assuming that all data has the same format, so all info is always in the same rows, and the storename is always in the same column.

library( readxlsb )
library( cellranger )
library( tidyverse )
library( data.table )

#get filesnames to read
read.these.files <- list.files( path = "./temp/", 
                                pattern = ".*\\.xlsb",
                                full.names = TRUE,
                                recursive = FALSE )
#now read the data to a list, using lapply()
#  assuming the data needed is in the first sheet of the .xlsb-file
L <- lapply( read.these.files, readxlsb::read_xlsb, sheet = 1, range = cellranger::cell_limits() )
#now we can loop over the read in data in list 'L', and perform operations
L.dt <- lapply( L, function(x) {
  #get store_name
  store_name = x[2,2]
  #get the data
  df1 <- x[7:10,]
  #set the colmanes (=labels) right
  colnames <- x[5:6,]
  colnames[ colnames == "" ] <- NA
  names(df1) <- colnames %>% tidyr::fill( names(colnames) ) %>% slice(2)
  names(df1)[1] <- "product"
  #melt df1 to long format
  df1 <- df1 %>% tidyr::pivot_longer( cols = tidyselect::starts_with("label"), names_to = "label" )
  #set the dates right
  dates <- x[4, ]
  dates <- dates %>% tidyr::pivot_longer( cols = tidyselect::everything())
  dates[ dates == "" ] <- NA
  dates <- tidyr::fill( dates, value ) %>% dplyr::slice(2:n() )
  #add the dates and storename and tidy the .copy column
  df1 <- df1 %>% 
    dplyr::mutate( date  = rep( dates$value, nrow(df1) / length( dates$value) ),
            store = store_name ) %>%
    dplyr::select( -.copy )
})
#create a names list, based on the sourecefile-names
names(L.dt) <- basename( read.these.files )
#now, bind the list of alterend data together into one _long_ data set
L.dt_tbl <- bind_rows(L.dt, .id = 'id')
L.dt_tbl %>% dplyr::mutate(date = str_sub(date, start= -10)  %>%
             lubridate::dmy() ) -> L.dt_tbl
'

Convert value to type double,

dta_long <- type_convert(L.dt_tbl, cols(
  `Type of NPE` = col_character(),
  `What NPE` = col_character(),
  value = col_double(),
  institut = col_character()
))

Final data,

dta_long
# A tibble: 96 x 6
   id             product label   value date       store  
   <chr>          <chr>   <chr>   <dbl> <date>     <chr>  
 1 dta_wide1.xlsb peach   label1   7    2019-03-31 Store A
 2 dta_wide1.xlsb peach   label2   0.5  2019-03-31 Store A
 3 dta_wide1.xlsb peach   label4a 21    2019-03-31 Store A
 4 dta_wide1.xlsb peach   label4b  0.3  2019-03-31 Store A
 5 dta_wide1.xlsb peach   label1   5    2019-06-30 Store A
 6 dta_wide1.xlsb peach   label2   0.46 2019-06-30 Store A
 7 dta_wide1.xlsb peach   label4a 19    2019-06-30 Store A
 8 dta_wide1.xlsb peach   label4b  0.4  2019-06-30 Store A
 9 dta_wide1.xlsb peach   label1   4    2019-09-30 Store A
10 dta_wide1.xlsb peach   label2   0.1  2019-09-30 Store A
# ... with 86 more rows
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thank you for responding to my question. I ran your code. I get an error when I run your code saying that I _cannot subset columns that don't exist_, as the column `.copy` doesn't exist. However, I realize I should have made it clearer that I am looking for a _flexible solution that can be used in a loop_. Hence the two data sets in the reproducible example. – Eric Fail Jun 05 '20 at 20:52
  • I imagine I would need to get the positions of the _dates_ in row 5 in the data provided; `dta_wide1[5, ]` and `dta_wide2[5, ]`. Then build from there; count the number of quarters in that given block based, inset the dates in a column, and so forth. Also, as I mention, I am looking for a solution that can help me persuade my research group to use R, hence my bounty and the point about a solution that is _simple/smart/clean_. – Eric Fail Jun 05 '20 at 20:52
  • import in a loop? Why not add the filenames into a vector/list ( ussing `list.files()`, and 'loop' it with `lapply()` – Wimpel Jun 06 '20 at 08:49
  • It doesn't have to be a loop per se. Main thing is that the _the call is flexible_ and able to handle _data with varying width_ that needs to pivoted _from wide to long_. I guess a `lapply()` call would work too. – Eric Fail Jun 06 '20 at 09:09
  • The code above handles date with different width in your sample date (i.e. 2 and 4 quarters) It can hande up to n quarters, as long as the 'blocks' of quarters are the same format... – Wimpel Jun 06 '20 at 09:27
  • I see. Thanks. I still get an error that I cannot subset columns that don't exist. Don't you get that? Could possible demonstrate what you mean by _adding the filenames into a vector/list (using `list.files()`, and 'loop' it with `lapply()`_? – Eric Fail Jun 07 '20 at 11:20
  • updated code with some sample-*.xlsb-files... and rasding with `lapply()` – Wimpel Jun 08 '20 at 09:17
  • I do not get any error when rinning the code, but I do get a warning -messae `Warning messages: 1: Duplicate column names detected, adding .copy variable `.. you can suppress these warnings if you want with `suppressWarnings()` – Wimpel Jun 08 '20 at 09:22
  • Very interesting. Thanks a lot! Very illustrative with the `lapply()` call. I am currently working to reduce the number of packages, i.e. rewrite the [tag:data.table] to [tag:tidyverse]. (my research group gets very reluctant whenever I introduce new packages). I think I fixed the `.copy` error. I appreciate you keep developing your answer! – Eric Fail Jun 08 '20 at 14:47