1

R coding in Databricks.

I want the time intervals in hours between date1 entries, arranged by pid, med and date1.

I want the latest date1 entry in a sequence of events per date to be manually adjustable to 24 hours.

A cohort is when pid, med and date1 is the same.
Any change will terminate the last cohort with an hour_output == 24.

df

  pid       med                date1                    
1  1       drugA             2023-02-02 09:00:00         
2  1       drugA             2023-02-02 12:00:00         
3  1       drugA             2023-02-02 14:00:00        
4  1       drugB             2023-02-03 10:00:00         
5  1       drugB             2023-02-03 18:00:00 

What script was attempted.

df1 <- df %>%
  arrange(pid, med, date1) %>%
  mutate(hours_output = as.numeric(difftime(lead(date1), date1, units = "hours")))

# Replace the last duration value with 24 hours
df1$hours_output[last(nrow(df1))] <- 24

df1 <- df1 %>% select(med, date1, hours_output)
head(df1)

Actual output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        20.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        18.00 (18 hours to the next row - not shown)

Desired output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        24.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        24.00
db2020
  • 69
  • 8
  • It isn't clear to my why row 3 is 24 hours; row 4 is same pid, same med, and 20 hours in the future, why do you want it to be 24? Is it because it isn't the same _date_? – r2evans Apr 26 '23 at 18:49
  • The dummy data is loosely based on health data which needs the last date to be 24 hours as the true duration is unknown. The pid is for a single patient, but can be many different pids, the same for the med. So that pid 1 can have one or multiple meds given in a day and the final datetime is made 24 hours. I corrected drugB to drugA in row 3. – db2020 Apr 26 '23 at 19:23
  • I think your edit is a better resolution of my question: row 3 should have been `drugA`, not `drugB` as it used to be. – r2evans Apr 26 '23 at 19:27
  • To be clear ... you say `date1 is the same`, but all `date1` are different ... do you mean that the last row of a day itself should always be 24? What if the next row is the same patient, same med, and fewer than 24 hours away, should it be the min of the real delta and 24? – r2evans Apr 26 '23 at 19:34
  • Rows with the same pid, med and date, but different times are to be considered a cohort. Each cohort interval can only be calculated if there is a follow on entry with the same pid, med and date. For the final cohort datetime, an interval cannot be calculated, therefore, I will use 24 hours as a default. – db2020 Apr 26 '23 at 20:03
  • If the pid or med or date changes, this will lead to a restart of the interval count until the final cohort datetime which will == 24 hours. Also, if only one row exists == 24 hours. – db2020 Apr 26 '23 at 20:06

3 Answers3

1

Since you want to do it per patient and per med, you should be using group_by so that differences don't mistakenly calculate differences between different groups.

library(dplyr) # 1.1.0 for .by=
df %>%
  mutate(date1 = as.POSIXct(date1)) %>% # may not be needed with your real data
  mutate(
    hours_output = as.numeric(c(diff(date1), 24), units="hours"),
    .by = c(pid, med)
  )
#   pid   med               date1 hours_output
# 1   1 drugA 2023-02-02 09:00:00      3 hours
# 2   1 drugA 2023-02-02 12:00:00      2 hours
# 3   1 drugA 2023-02-02 14:00:00     24 hours
# 4   1 drugB 2023-02-03 10:00:00      8 hours
# 5   1 drugB 2023-02-03 18:00:00     24 hours

I'm using .by= which is new to dplyr_1.1.0; if you have a version before that, then use group_by explicitly:

df %>%
  mutate(date1 = as.POSIXct(date1)) %>%
  group_by(pid, med) %>%
  mutate(hours_output = as.numeric(c(diff(date1), 24), units="hours"))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you for your assistance and I have tried both approaches without success. I have tried many coding variations and on each occasion I have returned outputs in seconds when hours was coded for and quite high values which are high even after converting to hours. Not sure what is causing this. – db2020 Apr 26 '23 at 19:59
  • *"seconds when hours was coded"* --> see my edit for the use of `as.numeric(.., units="hours")`, it should correct errant "seconds" in your output. As for the premise of "when pid, med, date changes", that's a fragile way of grouping data, I term it the "spreadsheet grouping" since Excel doesn't do it more robustly. The use of `group_by` (or `.by=`) is much more robust to order and such that I strongly suggest switching to that mindset. (There are some circumstances where it is not preferred, such as between-group analysis, but that does not yet seem suggested here.) – r2evans Apr 26 '23 at 21:31
0

I can make this work as a sample in Databricks.

Working example using dummy data

df <- df %>% arrange(datetime_col)

# Sample dataframe with datetime values
df <- data.frame(datetime_col = c("2023-02-02 09:00:00", "2023-02-02 12:00:00", "2023-02-02 14:00:00"))

# Convert datetime column to POSIXct object
df$datetime_col <- as.POSIXct(df$datetime_col, format = "%Y-%m-%d %H:%M:%S")
#df$datetime_col

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$datetime_col, units = "hours"), 0)

# Convert durations to hours
durations <- as.numeric(durations, units = "hours")

# Replace last duration value with 24 hours
durations[length(durations)] <- 24

durations <- round(durations, 2)

df$duration <- durations

df_f <- df %>% select(datetime_col, duration)
df_f

db2020
  • 69
  • 8
0

The following works on the dummy data (as at top of question) when used in Databricks.

Gives desired output.

(Also, works on actual data in Databricks)

library(dplyr)
library(lubridate)

# Convert datetime column to POSIXct object
df$date1 <- ymd_hms(df$date1)

df <- df %>% arrange(date1)

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$date1), 0)

# Convert durations to hours and round to 2 decimal places
durations <- round(as.numeric(durations, units = "hours"), 2)

# Replace any negative values with 0
durations[durations < 0] <- 0

# Find last timestamp for each date and replace duration with 24 hours
last_times <- dc_4 %>%
  group_by(Date = as.Date(date1)) %>%
  slice_tail(n = 1) %>%
  ungroup()

durations[df$date1 %in% last_times$date1] <- 24

df$duration <- durations

df1 <- df %>% select(date1, duration)

head(df1, 10)

db2020
  • 69
  • 8