2

I have a df akin to df1 where I want to break out the rows so that the Hrs_Time_Worked column is in intervals of 4, shown in df2.

I have been using the following code, but it throws an error:

df2 = df1 %>%
 group_by(Row)%>%
 mutate(S=START_DATE_TIME,
        Hrs_Time_Worked=list((n<-c(rep(4,Hrs_Time_Worked%/%4),Hrs_Time_Worked%%4))[n!=0]))%>%
 unnest()%>%
 mutate(E=START_DATE_TIME+hours(cumsum(Hrs_Time_Worked)),
        S=E-hours(unlist(Hrs_Time_Worked)),
        START_DATE_TIME=(S),
        END_DATE_TIME=(E),
        S=NULL,E=NULL)

Error in mutate_impl(.data, dots) : Evaluation error: invalid class Period object: periods must have integer values.

The following are required:

All categorical data must remain the same on child rows (e.g., TIME_RPTG_CD stays the same on every child row)

If there is a remainder that is less than four, the remainder amount should be listed on the last line (e.g., df2; row 3)

If a child row starts or ends on the next date the date column should be updated accordingly (e.g., df2; row 2-3)

df1 (current)

   Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
       <chr>  <chr>        <dttm>              <dttm>                        <dbl>
     1 X00007 REG          2014-07-03 16:00:00 2014-07-03 02:00:00            10.0

df2 (desired)

Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
    <chr>  <chr>        <dttm>              <dttm>                        <dbl>
1   X00007 REG          2014-07-03 16:00:00 2014-07-03 20:00:00            4.0
2   X00007 REG          2014-07-03 20:00:00 2014-07-04 24:00:00            4.0
3   X00007 REG          2014-07-04 24:00:00 2014-07-04 02:00:00            2.0
samuelt
  • 215
  • 1
  • 2
  • 10
  • Please share the output of `dput(df1)` – Tung Jun 12 '18 at 00:59
  • Here is the solution that's worked: https://stackoverflow.com/questions/50690217/splitting-single-data-frame-row-into-multiple-rows-while-performing-calculation – samuelt Jul 17 '18 at 22:31

2 Answers2

1

One of the approach could be

library(dplyr)
library(tidyr)
library(lubridate)

df %>%
  rowwise() %>%
  mutate(START_DATE_TIME = paste(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour"), collapse = ",")) %>%
  separate_rows(START_DATE_TIME, sep = ",") %>%
  group_by(Row) %>%
  mutate(END_DATE_TIME   = ymd_hms(lead(START_DATE_TIME, order_by = Row, default = as.character(END_DATE_TIME))),
         START_DATE_TIME = ymd_hms(START_DATE_TIME),
         Hrs_Time_Worked = as.numeric(difftime(END_DATE_TIME, START_DATE_TIME, units = "hour"))) %>%
  filter(Hrs_Time_Worked > 0)

which gives

    Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
1     1 X00007 REG          2014-07-03 16:00:00 2014-07-03 20:00:00            4.00
2     1 X00007 REG          2014-07-03 20:00:00 2014-07-04 00:00:00            4.00
3     1 X00007 REG          2014-07-04 00:00:00 2014-07-04 02:00:00            2.00


Sample data:

df <- structure(list(Row = 1L, EMPLID = "X00007", TIME_RPTG_CD = "REG", 
    START_DATE_TIME = structure(1404403200, tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), END_DATE_TIME = structure(1404439200, tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), Hrs_Time_Worked = 10), .Names = c("Row", "EMPLID", 
"TIME_RPTG_CD", "START_DATE_TIME", "END_DATE_TIME", "Hrs_Time_Worked"
), row.names = c(NA, -1L), class = "data.frame")

#  Row EMPLID TIME_RPTG_CD     START_DATE_TIME       END_DATE_TIME Hrs_Time_Worked
#1   1 X00007          REG 2014-07-03 16:00:00 2014-07-04 02:00:00              10
Prem
  • 11,775
  • 1
  • 19
  • 33
  • Thanks for the response. For both this solution and @Moody_Mudskipper I get an error message: "Error in mutate_impl(.data, dots) : Evaluation error: wrong sign in 'by' argument." Any advice on how to address this? – samuelt Jun 12 '18 at 17:55
  • I think there is some `END_DATE_TIME` which is less than `START_DATE_TIME` in your data which could potentially be the culprit behind this error message. You can test it using `sum(difftime(df$END_DATE_TIME, df$START_DATE_TIME) < 0)` which shouldn't be greater than zero. If it is the case then you need to remove those rows and then run the code on remaining data else you may need to share the reproducible example to analyze further. – Prem Jun 13 '18 at 06:20
  • You may want to see this [link](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) as well. – Prem Jun 19 '18 at 06:18
0

Similar to @Prem's but using list columns and unnest :

df %>% 
  rowwise %>%
  mutate(START_DATE_TIME = list(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour")),
         END_DATE_TIME = list(c(tail(START_DATE_TIME,-1),END_DATE_TIME))) %>%
  unnest %>%
  mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))

# # A tibble: 3 x 6
#     Row EMPLID TIME_RPTG_CD Hrs_Time_Worked START_DATE_TIME     END_DATE_TIME      
#   <int> <chr>  <chr>        <time>          <dttm>              <dttm>             
# 1     1 X00007 REG          4               2014-07-03 16:00:00 2014-07-03 20:00:00
# 2     1 X00007 REG          4               2014-07-03 20:00:00 2014-07-04 00:00:00
# 3     1 X00007 REG          2               2014-07-04 00:00:00 2014-07-04 02:00:00

It is more efficient to use map than rowwise, though less readable in my opinion, to use map you can do this :

df %>% 
  mutate(START_DATE_TIME = map(START_DATE_TIME,~seq.POSIXt(., END_DATE_TIME, by = "4 hour")),
         END_DATE_TIME = map2(END_DATE_TIME,START_DATE_TIME,~c(tail(.y,-1),.x))) %>%
  unnest %>%
  mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))

#   Row EMPLID TIME_RPTG_CD Hrs_Time_Worked     START_DATE_TIME       END_DATE_TIME
# 1   1 X00007          REG         4 hours 2014-07-03 16:00:00 2014-07-03 20:00:00
# 2   1 X00007          REG         4 hours 2014-07-03 20:00:00 2014-07-04 00:00:00
# 3   1 X00007          REG         2 hours 2014-07-04 00:00:00 2014-07-04 02:00:00

In this case the output is not a tibble but a standard data.frame, which explains why the Hrs_Time_Worked column prints differently. Use as_tibble to get the same output. Or use as.numeric on any solution to have it as double.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167