Date differences between specific events in R

Question

I've spent quite some time trying to calcule date differences between specific events. For example, I have the data set below, where t is the event datetime, e is the event ('A' or 'R') and id is an identifier.

t                   e   id
2013-06-15 20:59:17 A   086
2013-06-26 18:02:09 R   086
2013-06-27 03:17:07 A   086
2013-06-01 11:34:09 R   115
2013-06-16 19:59:08 R   522
2013-06-16 20:05:04 R   522
2013-06-07 09:31:15 A   524
2013-06-09 16:24:04 R   524
2013-06-14 03:38:09 A   524
2013-06-16 15:49:09 R   524
2013-06-21 03:54:19 A   524
2013-06-12 12:34:37 A   638
2013-06-12 13:15:27 A   638
2013-06-15 16:12:23 R   638
2013-06-18 22:05:03 A   638
2013-06-28 13:30:20 R   638
2013-06-28 22:20:12 A   638
2013-06-01 18:34:46 A   836
2013-06-01 18:44:53 A   836
2013-06-03 14:35:09 R   836
2013-06-03 20:25:51 A   836
2013-06-03 20:27:25 A   836
2013-06-03 20:32:17 A   836
2013-06-08 16:22:07 R   836
2013-06-12 13:12:21 A   836
2013-06-12 13:15:16 A   836
2013-06-12 13:18:53 A   836
2013-06-12 18:59:24 A   836
2013-06-23 21:14:12 R   836
2013-06-24 20:16:11 R   836
2013-06-25 03:34:29 A   836
2013-06-26 20:33:11 R   836

Would like to make a new data set where, for each id, I would have the datetime difference (in days) between the 'A' events and the 'R' events.

For each id: if there are not events A before E then compute nothing. If there is three events A before two events E then compute the diff days with the first event A and the first event E. If there is only events A or E then compute nothing. Then, for the id=086, 115, 638 I would like to get this duples (id, days) (086, 10.9) (638, 3.2) (638, 9.6). The id=115 doesn't appears or could it be NaN beacause it has only one event R and no events A before it.

id  days 
086 10.9  (Ex. 2013-06-26 18:02:09 - 2013-06-15 20:59:17)
524 2.3
524 2.5
638 3.2
638 9.6
836 1.8
836 4.8   (Ex. 2013-06-08 16:22:07 - 2013-06-03 20:25:51)
836 11.3
836 1.7

I'm using this code as a first approach:

aggregate(as.POSIXct(df$t), list(df$id), diff)

Any ideas as to how to do this? Thanks in advance.

A dput of dataframe is:

> dput(df)
structure(list(t = c("2013-06-15 20:59:17", "2013-06-26 18:02:09", 
"2013-06-27 03:17:07", "2013-06-01 11:34:09", "2013-06-16 19:59:08", 
"2013-06-16 20:05:04", "2013-06-07 09:31:15", "2013-06-09 16:24:04", 
"2013-06-14 03:38:09", "2013-06-16 15:49:09", "2013-06-21 03:54:19", 
"2013-06-12 12:34:37", "2013-06-12 13:15:27", "2013-06-15 16:12:23", 
"2013-06-18 22:05:03", "2013-06-28 13:30:20", "2013-06-28 22:20:12", 
"2013-06-01 18:34:46", "2013-06-01 18:44:53", "2013-06-03 14:35:09", 
"2013-06-03 20:25:51", "2013-06-03 20:27:25", "2013-06-03 20:32:17", 
"2013-06-08 16:22:07", "2013-06-12 13:12:21", "2013-06-12 13:15:16", 
"2013-06-12 13:18:53", "2013-06-12 18:59:24", "2013-06-23 21:14:12", 
"2013-06-24 20:16:11", "2013-06-25 03:34:29", "2013-06-26 20:33:11"
), e = c("A", "R", "A", "R", "R", "R", "A", "R", "A", "R", "A", 
"A", "A", "R", "A", "R", "A", "A", "A", "R", "A", "A", "A", "R", 
"A", "A", "A", "A", "R", "R", "A", "R"), id = c("086", "086", 
"086", "115", "522", "522", "524", "524", "524", "524", "524", 
"638", "638", "638", "638", "638", "638", "836", "836", "836", 
"836", "836", "836", "836", "836", "836", "836", "836", "836", 
"836", "836", "836")), .Names = c("t", "e", "id"), row.names = c(855945L, 
1481100L, 1508045L, 16944L, 920490L, 921005L, 349201L, 494172L, 
746450L, 904442L, 1163757L, 653045L, 654357L, 834901L, 1047932L, 
1583218L, 1613753L, 36421L, 37178L, 139968L, 162274L, 162417L, 
162804L, 430725L, 654254L, 654350L, 654453L, 670726L, 1333676L, 
1384583L, 1401293L, 1491782L), class = "data.frame")

Your output shows sometimes several time spans for a given id. How does that fit to the "first A to first R event"? — Karsten W., Jul 18 '13 at 21:18
@KarstenW. : For example, for id==836, the first difference will be: df[df$id==836 & df$t<='2013-06-03 14:35:09',"t"][3] minus df[df$id==836 & df$t<='2013-06-03 14:35:09',"t"][1] — MSS, Jul 18 '13 at 21:39
And what do you want to happen when the number of "A" events does not match the number of "E" events for a given id? For instance, id=086, 115, 638? — Karsten W., Jul 19 '13 at 07:17
For each id: if there are not events A before E then compute nothing. If there is three events A before two events E then compute the diff days with the first event A and the first event E. If there is only events A or E then compute nothing. Then, for the id=086, 115, 638 I would like to get this duples (id, days) (086, 10.9) (638, 3.2) (638, 9.6). The id=115 doesn't appears or could it be NaN beacause it has only one event R and no events A before it. — MSS, Jul 19 '13 at 15:59

ialm · Accepted Answer · 2013-07-19T17:00:30.787

Here is a solution in 1 line using the ddply function from the plyr package and the lubridate package to parse the dates.

Code:

library(plyr)
library(lubridate)

new_df <- ddply(.data=df, .variables=c('id'), summarize,
                days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))
new_df

Output:

   id      days
1 086 10.9 days
2 115   NA days
3 522   NA days
4 524  2.3 days
5 638  3.2 days
6 836  1.8 days

Note that there are 2 warnings because the ids 115 and 522 do not have a value for the e variable.

If you want the date difference to be a decimal value, you can use the as.double function, like so:

Basically, I am using the match function to find the first occurrence of A and R, parsing the date variable with the ymd_hms function from the lubridate package, and then finding the difference of the two dates. I round it to 1 decimal place, and then convert it into a double for display.

EDIT

After reading the OPs comments, here is a rather ugly way to get the desired result. Forgive me, it is early in the morning, and it may not be elegant or efficient, but it seems to output the desired result.

Code:

grouper <- function(var, group) {
  num <- 1
  res <- c(1:length(var))
  for(i in 1:length(var)) {
    res[i] <- num
    if(var[i]==group) {
      num <- num+1
    }
  }
  return(res)
}

df2 <- df
df2$group <- ddply(.data=df, .variables='id', summarize, group=grouper(e,'R'))$group

df3 <- ddply(.data=df2, .variables=c('id','group'), summarize,
             days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))

df3[complete.cases(df3),-2]

Output:

    id      days
1  086 10.9 days
6  524  2.3 days
7  524  2.5 days
9  638  3.2 days
10 638  9.6 days
12 836  1.8 days
13 836  4.8 days
14 836 11.3 days
16 836  1.7 days

The idea is to add another column that groups the rows by the occurrence of an 'R' event, so that I can subset the data set by both ID and 'R' event. It is kind of hacky, and I am sure there are more elegant ways to do it.

Now, I'm off to get some coffee.

Thanks. It partially works because the id 524 has to appear 2 times (2.3 and 2.5 days) and it appears 1 time (2.3 days). I mean 2013-06-09 16:24:04 - 2013-06-07 09:31:15 = 2.3 and 2013-06-16 15:49:09 - 2013-06-14 03:38:09 = 2.5 — MSS, Jul 18 '13 at 22:58
Hmm, OK. When I read the question, I thought that you wanted just the difference between the first R event and the first A event for each ID. — ialm, Jul 19 '13 at 16:54

George Steblovsky · Answer 2 · 2013-07-20T07:55:58.530

No need for anything, but basic R. Order your data.frame, choose your "first" appearances and finally use aggregate similar to what you use:

df <- df[do.call(order, df), ]
df <- df[!duplicated(df[, c("id", "e")]), ]
tdiff <- function(x) {
  if(length(x) == 2) {
     rv <- as.numeric(difftime(strptime(x[2], format="%Y-%m-%d %H:%M:%S"),
                               strptime(x[1], format="%Y-%m-%d %H:%M:%S"),
                               units = "days"))
  } else {
     rv <- NA
  }
  rv
}

rv <- aggregate(df$t, by = list(id = df$id), tdiff)

Just for the sake of closure as you don't need it anymore, here is the version that works the way you want.

df <- df[do.call(order, df), ]
df_a <- subset(df, e == "A")
df_a <- df_a[!duplicated(df_a[, c("id", "e")]), ]
df_r <- subset(df, e == "R")
df_r[, 'A'] <- df_a[match(df_r$id, df_a$id), 't']
df_r[, 'R_A'] <- as.numeric(difftime(strptime(df_r[, 't'], format="%Y-%m-%d %H:%M:%S"),
                           strptime(df_r[, 'A'], format="%Y-%m-%d %H:%M:%S"),
                           units = "days"))
rv <- df_r[, c('id', 'R_A')]
rv[!is.na(rv$R_A) & rv$R_A < 0, 'R_A'] <- NA
rv <- rv[!duplicated(rv), ]

@MSV, sorry for an error. I did have access to R to test what I wrote. The fix is coming as I type. — George Steblovsky, Jul 19 '13 at 16:18
No prob, now the code works. But the result isn't ok because the id 524 has to appear 2 times (2.3 and 2.5 days) and with your code it appears 1 time (2.3 days). I mean (2013-06-09 16:24:04 - 2013-06-07 09:31:15) = 2.3 days and (2013-06-16 15:49:09 - 2013-06-14 03:38:09) = 2.5 days — MSS, Jul 20 '13 at 01:21

score 2 · Answer 3 · answered Jul 18 '13 at 21:31

2

Here is one approach

df <- transform(df, t=as.POSIXct(t))
sp <- split(df, df$id)
calc_diff <- function(x) {
    start <- min(subset(x, e=="A")$t)
    end <- min(subset(x, e=="R")$t)
    return(end-start)
}
sapply(sp, FUN=calc_diff)

answered Jul 18 '13 at 21:31

Karsten W.

17,826
11
69
103

Thanks. Here occurs the same that previous code. It partially works because the id 524 has to appear 2 times (2.3 and 2.5 days) and it appears 1 time (2.3 days). I mean 2013-06-09 16:24:04 - 2013-06-07 09:31:15 = 2.3 and 2013-06-16 15:49:09 - 2013-06-14 03:38:09 = 2.5 – MSS Jul 18 '13 at 23:01
Maybe you can alter the `calc_diff` function to fit your needs. I have still some difficulties understanding your needs, and I think it is too specialized to be useful for other readers. – Karsten W. Jul 19 '13 at 23:23

Date differences between specific events in R

3 Answers3

EDIT