0

I need to reshape my data, to get it in a proper format for Survival Analysis.

My current Dataset looks like this:

Product_Number            Date         Status 
     A                  2018-01-01        0
     A                  2018-01-02        1
     A                  2018-01-03        0
     B                  2018-01-01        0
     B                  2018-01-02        0
     B                  2018-01-03        0
     B                  2018-01-04        1
     C                  2018-01-01        0
     C                  2018-01-02        0

I need to reshape my data, based on the columns Product_Number, Date and Status (I want to count the number of days, per product, until the status shift to a 1. If the status is 0, the proces should start over again).

So the data should look like this:

Product_Number    Number_of_Days    Status 
       A                2             1    #Two days til status = 1
       A                1             0    #One day, status = 0 (no end date yet) 
       B                4             1    #Four days til status = 1
       C                2             0    #Two days, status is still 0 (no end date yet)

What have I tried so far?

I ordered my data by ProductNumber and Date. I love the DPLYR way, so I used:

df <- df %>% group_by(Product_Number, Date)   # note: my data is now in the form as in the example above. 

Then I tried to use the diff() function, to see the differences in dates (count the number of days). But I was unable to "stop" the count, when status switched (from 0 to 1, and vice versa).

I hope that I clearly explained the problem. Please let me know if you need some additional information.

R overflow
  • 1,292
  • 2
  • 17
  • 37
  • `df %>% group_by(Product_Number,Status) %>% summarise(Number_of_Days = n())` – denis Feb 07 '19 at 11:54
  • Can you explain how you got you answer for product `A` in the example? You don't want to just count the sizes of the groups, there's some sort of windowing going on, but I don't understand quite what you're aiming for. – hoyland Feb 07 '19 at 11:56

2 Answers2

1

This might be what you're looking for, if I got your question right.

library(dplyr)

df %>%
  mutate(Number_of_Days=1) %>%
  select(-Date) %>%
  group_by(Product_Number, Status) %>%
  summarise_all(sum,na.rm=T)

  Product_Number Status Number_of_Days
1 A                   0              2
2 A                   1              1
3 B                   0              3
4 B                   1              1
5 C                   0              2
alex_555
  • 1,092
  • 1
  • 14
  • 27
1

You could do:

library(dplyr)

df %>%
  group_by(Product_Number) %>%
  mutate(Date = as.Date(Date),
         group = cumsum(coalesce(as.numeric(lag(Status) == 1 & Status == 0), 1))) %>%
  group_by(Product_Number, group) %>%
  mutate(Number_of_Days = (last(Date) - first(Date)) + 1) %>%
  slice(n()) %>% ungroup() %>%
  select(-group, -Date)

Output:

# A tibble: 4 x 3
  Product_Number Status Number_of_Days
  <chr>           <int> <time>        
1 A                   1 2             
2 A                   0 1             
3 B                   1 4             
4 C                   0 2     
arg0naut91
  • 14,574
  • 2
  • 17
  • 38