1

I have a long df that looks like this:

df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2015-05-02", "2015-09-16", "2015-10-24", "2016-12-11", "2017-10-17", "2017-08-06", "2018-01-29", "2015-09-16", "2015-11-12", "2015-12-03")),
                 event=as.character(c("A", "C", "E", "A", "B", "D", "A", "E", "A", "B", "E")),
                 order=as.integer(c(1,2,3,1,2,3,1,2,1,2,3)),
                 diff=as.integer(c(0,416,553,0,414,724,0,176,0,57,78)))

df
    id       date event order diff
1  123 2014-03-12     A     1    0
2  123 2015-05-02     C     2  416
3  123 2015-09-16     E     3  553
4  124 2015-10-24     A     1    0
5  124 2016-12-11     B     2  414
6  124 2017-10-17     D     3  724
7  125 2017-08-06     A     1    0
8  125 2018-01-29     E     2  176
9  126 2015-09-16     A     1    0
10 126 2015-11-12     B     2   57
11 126 2015-12-03     E     3   78

Each id will always have an initial event A and a final event, either D or E (mutually exclusive). Events B and C might or might not occur. diff is the difference in days between the date of each event and the date of the initial event A per id.

I want to obtain a wide df where each event will be a column (e.g. A_status) where 0= absent; 1= present. Similarly, each correspondent diff will be a column (e.g. A_time). But when event B or C is absent (e.g B= 0 or C= 0), I want their time to be filled with either D_time or E_time, whichever is present.

I need to create two columns based on the values of D and E:

  1. a column D.E_status, where 0=D; 1=E, and
  2. a column D.E_time that will receive whichever time is recorded (of D or E).

This is the desired output:

id  A_status A_time B_status B_time C_status C_time D.E_status  D.E_time
123 1        0      0        553    1        416    1           553
124 1        0      1        414    0        724    0           724
125 1        0      0        176    0        176    1           176
126 1        0      1        57     0        78     1           78

I really appreciate your help on this given my very basic R skills.

Nao
  • 333
  • 2
  • 11
  • a few of questions- 1. why isn't row 2's D.E_status 1? 2. what's the point of A_status and A_time and D.E status if each id always has an initial event and a final event, and the initial event always starts at 0? 3. why are the missing values 0, and not NA? – Mark Sep 01 '23 at 12:19

2 Answers2

3

I'll demonstrate a dplyr/tidyr solution. Up front, I think your last value for B_time should be 57 not 78.

library(dplyr)
library(tidyr) # pivot_wider
df %>%
  mutate(
    status = if_else(event == "D", 0, 1),
    event = if_else(event %in% c("D", "E"), "D.E", event)
  ) %>%
  pivot_wider(
    id_cols = "id",
    names_from = "event", values_from = c("diff", "status")
  ) %>%
  rename_with(.fn = ~ sub("diff", "time", sub("(.*)_(.*)", "\\2_\\1", .))) %>%
  mutate(
    across(c(C_time, B_time), ~ coalesce(., D.E_time)),
    across(c(C_status, B_status), ~ +(!is.na(.)))
  ) %>%
  select(order(colnames(.))) %>%
  relocate(id)
# # A tibble: 4 × 9
#      id A_status A_time B_status B_time C_status C_time D.E_status D.E_time
#   <int>    <dbl>  <int>    <int>  <int>    <int>  <int>      <dbl>    <int>
# 1   123        1      0        0    553        1    416          1      553
# 2   124        1      0        1    414        0    724          0      724
# 3   125        1      0        0    176        0    176          1      176
# 4   126        1      0        1     57        0     78          1       78
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • You are right, just replaced the 78 by 57 in the last value for B_time. Thanks so much! Your solution is very neat and worked perfectly. Is there any way the columns *status and *time could be side by side according to the type of event? – Nao Sep 01 '23 at 13:10
  • 1
    Yes, standard dplyr: add `%>% select(order(colnames(.))) %>% relocate(id)` – r2evans Sep 01 '23 at 13:14
  • 1
    @r2evans - use `names_vary = "slowest"` in `pivot_wider()` to alternate columns. – Ritchie Sacramento Sep 01 '23 at 13:18
  • 1
    @RitchieSacramento, thank you! I haven't incorporated `names_vary` into my habits yet. In this example, though, it still leaves `B_*` at the end of the columns since `B` is the last of the `event`s to be observed (after grouping D/E together). – r2evans Sep 01 '23 at 13:32
  • 1
    @r2evans - right, in that case you also need to add `names_sort = TRUE` – Ritchie Sacramento Sep 01 '23 at 13:36
0

Here is another option, though it seems I am a bit late with my answer:

df %>% 
  mutate(D.E_status = ifelse(event == "D", 0, ifelse(event == "E", 1, NA)),
         event = ifelse(event %in% c("D", "E"), "D.E_time", event),
         ) %>% 
  select(id, event, diff, D.E_status) %>% 
  pivot_wider(id_cols = c("id", "D.E_status") , 
              names_from = event, 
              values_from = diff, names_sort = T) %>% 
  mutate(A_status = ifelse(is.na(A), NA, 1),
         B_status = ifelse(is.na(B), NA, 1),
         C_status = ifelse(is.na(C), NA, 1)) %>% 
  group_by(id) %>% 
  summarise(across(everything(), 
                   function(x) do.call(coalesce, as.list(c(x)))
    )) %>% 
  transmute(id, 
            A_status, 
            A_time = A,
            B_status, 
            B_time = ifelse(is.na(B), D.E_time, B),
            C_status, 
            C_time = ifelse(is.na(C), D.E_time, C),
            D.E_status,
            D.E_time) %>% 
  replace_na(list(A_status = 0, B_status = 0, C_status = 0)) 

# A tibble: 4 × 9
     id A_status A_time B_status B_time C_status C_time D.E_status D.E_time
  <int>    <dbl>  <int>    <dbl>  <int>    <dbl>  <int>      <dbl>    <int>
1   123        1      0        0    553        1    416          1      553
2   124        1      0        1    414        0    724          0      724
3   125        1      0        0    176        0    176          1      176
4   126        1      0        1     57        0     78          1       78
asd-tm
  • 3,381
  • 2
  • 24
  • 41