I want to reshape data from a long to wide format. I have an id
variable and for each id
10 different levels of x
. Additionally, there is further information y
and the date variables start
and stop
, which indicate when x
started and stopped. Here is code and data preview:
library(tidyverse)
set.seed(123)
id <- rep(1:10, each = 10)
x <- rep(LETTERS[1:10], 10)
y <- sample(letters[11:26], 100, replace = T)
start <- sample(seq(as.Date('2020-01-01'), as.Date('2021-01-01'), by="day"), 100, replace = T)
stop <- sample(seq(as.Date('2021-01-01'), as.Date('2022-01-01'), by="day"), 100, replace = T)
NA_index <- sample(seq(1:100), 60)
# replace dates with NA
start[NA_index] <- NA
stop[NA_index] <- NA
df <- data.frame(id, x, y, start, stop)
head(df, 13)
id x y start stop
1 1 A y 2020-11-03 2021-10-19
2 1 B y <NA> <NA>
3 1 C m 2020-03-29 2021-02-02
4 1 D x 2020-01-25 2021-02-09
5 1 E m 2020-10-17 2021-01-10
6 1 F t 2020-10-12 2021-07-19
7 1 G l 2020-06-07 2021-05-05
8 1 H p <NA> <NA>
9 1 I u <NA> <NA>
10 1 J o <NA> <NA>
11 2 A n <NA> <NA>
12 2 B x <NA> <NA>
13 2 C p <NA> <NA>
I want to reduce the data to 10 rows with 10 unique ids, where each id stores all its information in one row. The output should be shaped like the following preview:
id x y start stop x_id x2 y2 start2 stop2 x3 y3
1 1 A y 2020-11-03 2021-10-19 1 B y NA NA C m
2 2 A n NA NA 1 B x NA NA C p
3 3 A s NA NA 1 B v NA NA C s
4 4 A s NA NA 1 B x NA NA C m
5 5 A y NA NA 1 B z NA NA C t
6 6 A v 2020-10-14 2021-09-02 1 B o 2020-10-03 2021-06-03 C q
I already have a solution, which worked but this way seems not very efficient. Here, I created a sequence along id
, mutated new columns and stored all the information in the first row of each id
group. Afterwards I filtered for the first row of each id
group and get the desired output. Here is the code:
df_wide <- df %>%
group_by(id)%>%
mutate(x_id = seq_along(id)) %>%
mutate(
x2 = lead(x),
y2 = lead(y),
start2 = lead(start), stop2 = lead(stop),
x3 = lead(x, n = 2),
y3 = lead(y, n = 2),
start3 = lead(start, n = 2), stop3 = lead(stop, n = 2),
x4 = lead(x, n = 3),
y4 = lead(y, n = 3),
start4 = lead(start, n = 3), stop4 = lead(stop, n = 3),
x5 = lead(x, n = 4),
y5 = lead(y, n = 4),
start5 = lead(start, n = 4), stop5 = lead(stop, n = 4),
x6 = lead(x, n = 5),
y6 = lead(y, n = 5),
start6 = lead(start, n = 5), stop6 = lead(stop, n = 5),
x7 = lead(x, n = 6),
y7 = lead(y, n = 6),
start7 = lead(start, n = 6), stop7 = lead(stop, n = 6),
x8 = lead(x, n = 7),
y8 = lead(y, n = 7),
start8 = lead(start, n = 7), stop8 = lead(stop, n = 7),#
x9 = lead(x, n = 8),
y9 = lead(y, n = 8),
start9 = lead(start, n = 8), stop9 = lead(stop, n = 8),
x10 = lead(x, n = 9),
y10 = lead(y, n = 9),
start10 = lead(start, n = 9), stop10 = lead(stop, n = 9)
) %>%
filter(x_id == 1)
As above-mentioned I don't think that this is a good way to handle this problem but worked at least for me. I also tried a lot with tidyr::pivot_wider
but didn't get the desired output data.
Does anybody have an idea how to reshape the data but less long-winded like my approach?
I would be grateful.
Best Florian