0

I want to reshape data from a long to wide format. I have an id variable and for each id 10 different levels of x. Additionally, there is further information y and the date variables start and stop, which indicate when x started and stopped. Here is code and data preview:

library(tidyverse)

set.seed(123)
id <- rep(1:10, each = 10)
x <- rep(LETTERS[1:10], 10)
y <- sample(letters[11:26], 100, replace = T)
start <- sample(seq(as.Date('2020-01-01'), as.Date('2021-01-01'), by="day"), 100, replace = T)
stop <- sample(seq(as.Date('2021-01-01'), as.Date('2022-01-01'), by="day"), 100, replace = T)
NA_index <- sample(seq(1:100), 60)

# replace dates with NA
start[NA_index] <- NA
stop[NA_index] <- NA

df <- data.frame(id, x, y, start, stop)
head(df, 13)
   id x y      start       stop

1   1 A y 2020-11-03 2021-10-19
2   1 B y       <NA>       <NA>
3   1 C m 2020-03-29 2021-02-02
4   1 D x 2020-01-25 2021-02-09
5   1 E m 2020-10-17 2021-01-10
6   1 F t 2020-10-12 2021-07-19
7   1 G l 2020-06-07 2021-05-05
8   1 H p       <NA>       <NA>
9   1 I u       <NA>       <NA>
10  1 J o       <NA>       <NA>
11  2 A n       <NA>       <NA>
12  2 B x       <NA>       <NA>
13  2 C p       <NA>       <NA>

I want to reduce the data to 10 rows with 10 unique ids, where each id stores all its information in one row. The output should be shaped like the following preview:

   id x     y     start      stop        x_id x2    y2    start2     stop2      x3    y3 

1     1 A     y     2020-11-03 2021-10-19     1 B     y     NA         NA         C     m    
2     2 A     n     NA         NA             1 B     x     NA         NA         C     p    
3     3 A     s     NA         NA             1 B     v     NA         NA         C     s    
4     4 A     s     NA         NA             1 B     x     NA         NA         C     m    
5     5 A     y     NA         NA             1 B     z     NA         NA         C     t    
6     6 A     v     2020-10-14 2021-09-02     1 B     o     2020-10-03 2021-06-03 C     q    

I already have a solution, which worked but this way seems not very efficient. Here, I created a sequence along id, mutated new columns and stored all the information in the first row of each id group. Afterwards I filtered for the first row of each id group and get the desired output. Here is the code:

df_wide <- df %>% 
  group_by(id)%>%
  mutate(x_id = seq_along(id)) %>%
  mutate(
    x2 = lead(x), 
    y2 = lead(y), 
    start2 = lead(start), stop2 = lead(stop),
    x3 = lead(x, n = 2), 
    y3 = lead(y, n = 2), 
    start3 = lead(start, n = 2), stop3 = lead(stop, n = 2),
    x4 = lead(x, n = 3), 
    y4 = lead(y, n = 3), 
    start4 = lead(start, n = 3), stop4 = lead(stop, n = 3),
    x5 = lead(x, n = 4), 
    y5 = lead(y, n = 4),
    start5 = lead(start, n = 4), stop5 = lead(stop, n = 4),
    x6 = lead(x, n = 5), 
    y6 = lead(y, n = 5), 
    start6 = lead(start, n = 5), stop6 = lead(stop, n = 5),
    x7 = lead(x, n = 6),
    y7 = lead(y, n = 6),
    start7 = lead(start, n = 6), stop7 = lead(stop, n = 6),
    x8 = lead(x, n = 7),
    y8 = lead(y, n = 7),
    start8 = lead(start, n = 7), stop8 = lead(stop, n = 7),#
    x9 = lead(x, n = 8),
    y9 = lead(y, n = 8),
    start9 = lead(start, n = 8), stop9 = lead(stop, n = 8),
    x10 = lead(x, n = 9),
    y10 = lead(y, n = 9),
    start10 = lead(start, n = 9), stop10 = lead(stop, n = 9)
  ) %>%
  filter(x_id == 1)

As above-mentioned I don't think that this is a good way to handle this problem but worked at least for me. I also tried a lot with tidyr::pivot_wider but didn't get the desired output data.
Does anybody have an idea how to reshape the data but less long-winded like my approach?

I would be grateful.

Best Florian

fbeese
  • 118
  • 8

2 Answers2

1

Create a row number column for each id and reshape the data to wide format.

library(dplyr)
library(tidyr)

df %>%
  group_by(id) %>%
  mutate(col = row_number()) %>%
  ungroup %>%
  pivot_wider(names_from = col, values_from = x:stop)

# A tibble: 10 x 41
#      id x_1   x_2   x_3   x_4   x_5   x_6   x_7   x_8   x_9   x_10 
#   <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1     1 A     B     C     D     E     F     G     H     I     J    
# 2     2 A     B     C     D     E     F     G     H     I     J    
# 3     3 A     B     C     D     E     F     G     H     I     J    
# 4     4 A     B     C     D     E     F     G     H     I     J    
# 5     5 A     B     C     D     E     F     G     H     I     J    
# 6     6 A     B     C     D     E     F     G     H     I     J    
# 7     7 A     B     C     D     E     F     G     H     I     J    
# 8     8 A     B     C     D     E     F     G     H     I     J    
# 9     9 A     B     C     D     E     F     G     H     I     J    
#10    10 A     B     C     D     E     F     G     H     I     J    
# … with 30 more variables: y_1 <chr>, y_2 <chr>, y_3 <chr>,
#   y_4 <chr>, y_5 <chr>, y_6 <chr>, y_7 <chr>, y_8 <chr>, y_9 <chr>,
#   y_10 <chr>, start_1 <date>, start_2 <date>, start_3 <date>,
#   start_4 <date>, start_5 <date>, start_6 <date>, start_7 <date>,
#   start_8 <date>, start_9 <date>, start_10 <date>, stop_1 <date>,
#   stop_2 <date>, stop_3 <date>, stop_4 <date>, stop_5 <date>,
#   stop_6 <date>, stop_7 <date>, stop_8 <date>, stop_9 <date>,
#   stop_10 <date>
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks Ronak, this seems straightforward. To define the row numbers for each `id` I used `mutate = seq_along(id)` in my code. It gives the same output as with `mutate = row_number()`. Do you have an idea, how I can define the order of the output columns? Ideally, it should have the order `id` - `x_1` - `y_1` - `start_1` - `stop1` - `x_2` and so on. Of course, there is the opportunity to use `select(id, x_1, y_1 ...)` after a further pipe but is there an easier way without typing each variable out? – fbeese Jun 11 '21 at 07:56
  • @fbeese, actually arrangement of columns this way is still an [open issue on github](https://github.com/tidyverse/tidyr/issues/839) – AnilGoyal Jun 11 '21 at 08:27
  • @fbeese Yes, there is not an easy way to do this. Having said that here is an ugly workaround `%>% select(id, order(readr::parse_number(names(.))[-1]) + 1)`. – Ronak Shah Jun 11 '21 at 10:20
1

Naming the columns took most of the time:

library(tidyverse)

set.seed(123)
id <- rep(1:10, each = 10)
x <- rep(LETTERS[1:10], 10)
y <- sample(letters[11:26], 100, replace = T)
start <- sample(seq(as.Date('2020-01-01'), as.Date('2021-01-01'), by="day"), 100, replace = T)
stop <- sample(seq(as.Date('2021-01-01'), as.Date('2022-01-01'), by="day"), 100, replace = T)
NA_index <- sample(seq(1:100), 60)

# replace dates with NA
start[NA_index] <- NA
stop[NA_index] <- NA

df <- data.frame(id, x, y, start, stop)

#setting the names of the columns        
col_names <- 
    colnames(df) %>%
    map(~ map2(., 1:length(unique(df$id)), ~paste0(.x, '_', .y) )) %>%
    transpose() %>%
    map(~ reduce(.x, cbind)) 


df_wide <- group_split(df, x) %>%
    map2(col_names, ~ set_names(.x, .y)) %>%
    reduce(cbind) 

head(as_tibble(df_wide))
#> # A tibble: 6 x 50
#>    id_1 x_1   y_1   start_1    stop_1      id_2 x_2   y_2   start_2   
#>   <int> <chr> <chr> <date>     <date>     <int> <chr> <chr> <date>    
#> 1     1 A     y     2020-11-03 2021-10-19     1 B     y     NA        
#> 2     2 A     n     NA         NA             2 B     x     NA        
#> 3     3 A     s     NA         NA             3 B     v     NA        
#> 4     4 A     s     NA         NA             4 B     x     NA        
#> 5     5 A     y     NA         NA             5 B     z     NA        
#> 6     6 A     v     2020-10-14 2021-09-02     6 B     o     2020-10-03
#> # … with 41 more variables: stop_2 <date>, id_3 <int>, x_3 <chr>, y_3 <chr>,
#> #   start_3 <date>, stop_3 <date>, id_4 <int>, x_4 <chr>, y_4 <chr>,
#> #   start_4 <date>, stop_4 <date>, id_5 <int>, x_5 <chr>, y_5 <chr>,
#> #   start_5 <date>, stop_5 <date>, id_6 <int>, x_6 <chr>, y_6 <chr>,
#> #   start_6 <date>, stop_6 <date>, id_7 <int>, x_7 <chr>, y_7 <chr>,
#> #   start_7 <date>, stop_7 <date>, id_8 <int>, x_8 <chr>, y_8 <chr>,
#> #   start_8 <date>, stop_8 <date>, id_9 <int>, x_9 <chr>, y_9 <chr>,
#> #   start_9 <date>, stop_9 <date>, id_10 <int>, x_10 <chr>, y_10 <chr>,
#> #   start_10 <date>, stop_10 <date>

head(as_tibble(df_wide[, 11:16]))
#> # A tibble: 6 x 6
#>    id_3 x_3   y_3   start_3    stop_3      id_4
#>   <int> <chr> <chr> <date>     <date>     <int>
#> 1     1 C     m     2020-03-29 2021-02-02     1
#> 2     2 C     p     NA         NA             2
#> 3     3 C     s     NA         NA             3
#> 4     4 C     m     NA         NA             4
#> 5     5 C     t     NA         NA             5
#> 6     6 C     q     2020-02-24 2021-02-10     6

Created on 2021-06-11 by the reprex package (v2.0.0)

jpdugo17
  • 6,816
  • 2
  • 11
  • 23
  • Thanks jpdugo17, this creates the output as desired. To be honest, I'm not pretty familiar with the `map` function (whereas it would be pretty helpful to be familiar with it). I just ran your code and it worked but I have problems to understand, what exactly happens. In the first code block we create a list with the columns names. When I read the output of `col_names` there are terms like `out` and `elt` above each element in the list. What does it mean? In the next block we split the `df` into each group according to `x`, assign the `col_names` and then merge the split groups together. – fbeese Jun 11 '21 at 08:13