0

I've spent quite some time trying to calcule date differences between specific events. For example, I have the data set below, where t is the event datetime, e is the event ('A' or 'R') and id is an identifier.

t                   e   id
2013-06-15 20:59:17 A   086
2013-06-26 18:02:09 R   086
2013-06-27 03:17:07 A   086
2013-06-01 11:34:09 R   115
2013-06-16 19:59:08 R   522
2013-06-16 20:05:04 R   522
2013-06-07 09:31:15 A   524
2013-06-09 16:24:04 R   524
2013-06-14 03:38:09 A   524
2013-06-16 15:49:09 R   524
2013-06-21 03:54:19 A   524
2013-06-12 12:34:37 A   638
2013-06-12 13:15:27 A   638
2013-06-15 16:12:23 R   638
2013-06-18 22:05:03 A   638
2013-06-28 13:30:20 R   638
2013-06-28 22:20:12 A   638
2013-06-01 18:34:46 A   836
2013-06-01 18:44:53 A   836
2013-06-03 14:35:09 R   836
2013-06-03 20:25:51 A   836
2013-06-03 20:27:25 A   836
2013-06-03 20:32:17 A   836
2013-06-08 16:22:07 R   836
2013-06-12 13:12:21 A   836
2013-06-12 13:15:16 A   836
2013-06-12 13:18:53 A   836
2013-06-12 18:59:24 A   836
2013-06-23 21:14:12 R   836
2013-06-24 20:16:11 R   836
2013-06-25 03:34:29 A   836
2013-06-26 20:33:11 R   836

Would like to make a new data set where, for each id, I would have the datetime difference (in days) between the 'A' events and the 'R' events.

For each id: if there are not events A before E then compute nothing. If there is three events A before two events E then compute the diff days with the first event A and the first event E. If there is only events A or E then compute nothing. Then, for the id=086, 115, 638 I would like to get this duples (id, days) (086, 10.9) (638, 3.2) (638, 9.6). The id=115 doesn't appears or could it be NaN beacause it has only one event R and no events A before it.

id  days 
086 10.9  (Ex. 2013-06-26 18:02:09 - 2013-06-15 20:59:17)
524 2.3
524 2.5
638 3.2
638 9.6
836 1.8
836 4.8   (Ex. 2013-06-08 16:22:07 - 2013-06-03 20:25:51)
836 11.3
836 1.7

I'm using this code as a first approach:

aggregate(as.POSIXct(df$t), list(df$id), diff)

Any ideas as to how to do this? Thanks in advance.

A dput of dataframe is:

> dput(df)
structure(list(t = c("2013-06-15 20:59:17", "2013-06-26 18:02:09", 
"2013-06-27 03:17:07", "2013-06-01 11:34:09", "2013-06-16 19:59:08", 
"2013-06-16 20:05:04", "2013-06-07 09:31:15", "2013-06-09 16:24:04", 
"2013-06-14 03:38:09", "2013-06-16 15:49:09", "2013-06-21 03:54:19", 
"2013-06-12 12:34:37", "2013-06-12 13:15:27", "2013-06-15 16:12:23", 
"2013-06-18 22:05:03", "2013-06-28 13:30:20", "2013-06-28 22:20:12", 
"2013-06-01 18:34:46", "2013-06-01 18:44:53", "2013-06-03 14:35:09", 
"2013-06-03 20:25:51", "2013-06-03 20:27:25", "2013-06-03 20:32:17", 
"2013-06-08 16:22:07", "2013-06-12 13:12:21", "2013-06-12 13:15:16", 
"2013-06-12 13:18:53", "2013-06-12 18:59:24", "2013-06-23 21:14:12", 
"2013-06-24 20:16:11", "2013-06-25 03:34:29", "2013-06-26 20:33:11"
), e = c("A", "R", "A", "R", "R", "R", "A", "R", "A", "R", "A", 
"A", "A", "R", "A", "R", "A", "A", "A", "R", "A", "A", "A", "R", 
"A", "A", "A", "A", "R", "R", "A", "R"), id = c("086", "086", 
"086", "115", "522", "522", "524", "524", "524", "524", "524", 
"638", "638", "638", "638", "638", "638", "836", "836", "836", 
"836", "836", "836", "836", "836", "836", "836", "836", "836", 
"836", "836", "836")), .Names = c("t", "e", "id"), row.names = c(855945L, 
1481100L, 1508045L, 16944L, 920490L, 921005L, 349201L, 494172L, 
746450L, 904442L, 1163757L, 653045L, 654357L, 834901L, 1047932L, 
1583218L, 1613753L, 36421L, 37178L, 139968L, 162274L, 162417L, 
162804L, 430725L, 654254L, 654350L, 654453L, 670726L, 1333676L, 
1384583L, 1401293L, 1491782L), class = "data.frame")
Karsten W.
  • 17,826
  • 11
  • 69
  • 103
MSS
  • 53
  • 1
  • 8
  • can you `dput` your data so its easier to get into R – Jake Burkhead Jul 18 '13 at 20:43
  • Your output shows sometimes several time spans for a given id. How does that fit to the "first A to first R event"? – Karsten W. Jul 18 '13 at 21:18
  • @KarstenW. : For example, for id==836, the first difference will be: df[df$id==836 & df$t<='2013-06-03 14:35:09',"t"][3] minus df[df$id==836 & df$t<='2013-06-03 14:35:09',"t"][1] – MSS Jul 18 '13 at 21:39
  • And what do you want to happen when the number of "A" events does not match the number of "E" events for a given id? For instance, id=086, 115, 638? – Karsten W. Jul 19 '13 at 07:17
  • For each id: if there are not events A before E then compute nothing. If there is three events A before two events E then compute the diff days with the first event A and the first event E. If there is only events A or E then compute nothing. Then, for the id=086, 115, 638 I would like to get this duples (id, days) (086, 10.9) (638, 3.2) (638, 9.6). The id=115 doesn't appears or could it be NaN beacause it has only one event R and no events A before it. – MSS Jul 19 '13 at 15:59

3 Answers3

3

Here is a solution in 1 line using the ddply function from the plyr package and the lubridate package to parse the dates.

Code:

library(plyr)
library(lubridate)

new_df <- ddply(.data=df, .variables=c('id'), summarize,
                days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))
new_df

Output:

   id      days
1 086 10.9 days
2 115   NA days
3 522   NA days
4 524  2.3 days
5 638  3.2 days
6 836  1.8 days

Note that there are 2 warnings because the ids 115 and 522 do not have a value for the e variable.

If you want the date difference to be a decimal value, you can use the as.double function, like so:

Basically, I am using the match function to find the first occurrence of A and R, parsing the date variable with the ymd_hms function from the lubridate package, and then finding the difference of the two dates. I round it to 1 decimal place, and then convert it into a double for display.

EDIT

After reading the OPs comments, here is a rather ugly way to get the desired result. Forgive me, it is early in the morning, and it may not be elegant or efficient, but it seems to output the desired result.

Code:

grouper <- function(var, group) {
  num <- 1
  res <- c(1:length(var))
  for(i in 1:length(var)) {
    res[i] <- num
    if(var[i]==group) {
      num <- num+1
    }
  }
  return(res)
}

df2 <- df
df2$group <- ddply(.data=df, .variables='id', summarize, group=grouper(e,'R'))$group

df3 <- ddply(.data=df2, .variables=c('id','group'), summarize,
             days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))

df3[complete.cases(df3),-2]

Output:

    id      days
1  086 10.9 days
6  524  2.3 days
7  524  2.5 days
9  638  3.2 days
10 638  9.6 days
12 836  1.8 days
13 836  4.8 days
14 836 11.3 days
16 836  1.7 days

The idea is to add another column that groups the rows by the occurrence of an 'R' event, so that I can subset the data set by both ID and 'R' event. It is kind of hacky, and I am sure there are more elegant ways to do it.

Now, I'm off to get some coffee.

ialm
  • 8,510
  • 4
  • 36
  • 48
  • Thanks. It partially works because the id 524 has to appear 2 times (2.3 and 2.5 days) and it appears 1 time (2.3 days). I mean 2013-06-09 16:24:04 - 2013-06-07 09:31:15 = 2.3 and 2013-06-16 15:49:09 - 2013-06-14 03:38:09 = 2.5 – MSS Jul 18 '13 at 22:58
  • Hmm, OK. When I read the question, I thought that you wanted just the difference between the first R event and the first A event for each ID. – ialm Jul 19 '13 at 16:54
3

No need for anything, but basic R. Order your data.frame, choose your "first" appearances and finally use aggregate similar to what you use:

df <- df[do.call(order, df), ]
df <- df[!duplicated(df[, c("id", "e")]), ]
tdiff <- function(x) {
  if(length(x) == 2) {
     rv <- as.numeric(difftime(strptime(x[2], format="%Y-%m-%d %H:%M:%S"),
                               strptime(x[1], format="%Y-%m-%d %H:%M:%S"),
                               units = "days"))
  } else {
     rv <- NA
  }
  rv
}

rv <- aggregate(df$t, by = list(id = df$id), tdiff)

Just for the sake of closure as you don't need it anymore, here is the version that works the way you want.

df <- df[do.call(order, df), ]
df_a <- subset(df, e == "A")
df_a <- df_a[!duplicated(df_a[, c("id", "e")]), ]
df_r <- subset(df, e == "R")
df_r[, 'A'] <- df_a[match(df_r$id, df_a$id), 't']
df_r[, 'R_A'] <- as.numeric(difftime(strptime(df_r[, 't'], format="%Y-%m-%d %H:%M:%S"),
                           strptime(df_r[, 'A'], format="%Y-%m-%d %H:%M:%S"),
                           units = "days"))
rv <- df_r[, c('id', 'R_A')]
rv[!is.na(rv$R_A) & rv$R_A < 0, 'R_A'] <- NA
rv <- rv[!duplicated(rv), ]
  • Error in df$t : $ operator is invalid for atomic vectors – MSS Jul 18 '13 at 23:11
  • @MSV, sorry for an error. I did have access to R to test what I wrote. The fix is coming as I type. – George Steblovsky Jul 19 '13 at 16:18
  • No prob, now the code works. But the result isn't ok because the id 524 has to appear 2 times (2.3 and 2.5 days) and with your code it appears 1 time (2.3 days). I mean (2013-06-09 16:24:04 - 2013-06-07 09:31:15) = 2.3 days and (2013-06-16 15:49:09 - 2013-06-14 03:38:09) = 2.5 days – MSS Jul 20 '13 at 01:21
2

Here is one approach

df <- transform(df, t=as.POSIXct(t))
sp <- split(df, df$id)
calc_diff <- function(x) {
    start <- min(subset(x, e=="A")$t)
    end <- min(subset(x, e=="R")$t)
    return(end-start)
}
sapply(sp, FUN=calc_diff)
Karsten W.
  • 17,826
  • 11
  • 69
  • 103
  • Thanks. Here occurs the same that previous code. It partially works because the id 524 has to appear 2 times (2.3 and 2.5 days) and it appears 1 time (2.3 days). I mean 2013-06-09 16:24:04 - 2013-06-07 09:31:15 = 2.3 and 2013-06-16 15:49:09 - 2013-06-14 03:38:09 = 2.5 – MSS Jul 18 '13 at 23:01
  • Maybe you can alter the `calc_diff` function to fit your needs. I have still some difficulties understanding your needs, and I think it is too specialized to be useful for other readers. – Karsten W. Jul 19 '13 at 23:23