using data.table to calculate and format time difference between rows by group

Question

Im new to the data.table "scene" so I apologize if my question is simplistic. I am constantly in the position to where I have to apply some analysis or subset some data grouped by a Unique ID. Typically I have about 1,000 rows per Unique ID with about 30 Unique IDs. So, Ive been advised to switch to data.table instead of trying to figure out the lapply or sapply or plyr package.

Here's a sample of my type of data

    structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), dt = structure(c(1138366975, 
1138370472, 1138374064, 1138377669, 1138381264, 1138384873, 1138388503, 
1138399312, 1138402842, 1138406507, 1138413700, 1138417261, 1138420848, 
1138424444, 1138428071, 1138431695, 1138435287, 1138438938, 1138442428, 
1138446098), class = c("POSIXct", "POSIXt"), tzone = "GMT")), .Names = c("ID", 
"dt"), row.names = c(NA, -20L), class = "data.frame")

I convert this into a data.table

X = data.table(test)

set my "key" to be the Individual

setkey(X,ID)

Then the goal is to calculate in HOURS(at the moment either I am hoping will be easy) the time difference. So to take Time2-Time1 to get the hours and minutes between each successive location BY Individual (in this case ID).

X[, diff:=c(NA,diff(dt)),by = ID]

The diff command here calculates it in minutes, but I would like to convert/round this to hours in the most efficient way while still keeping the value as a POSIX or time object. I know I could likely create yet another column and divide diff by 60. But I was hoping that there was some way to just type "hours" or "minutes" or something somewhere. As I am not understanding how data.table handles time. Ive tried doing this in a data.frame using a for loop using difftime command, but its so cumbersome and linking the data back to the original dataframe is confusing to me as I am not proficient with for loops.

Once I get the data into hours, I want to select only the data that is 0.5 hours apart, then 4 hours apart then 12 hours apart. Which I haven't figured out how to do yet in data.table

Simon O'Hanlon · Answer 1 · 2013-12-13T12:25:09.787

2

Here's one way to do it, prob not the most efficient though...

X[ , diff := c( NA_character_ , difftime( tail( dt , -1 ) , head( dt , -1 ) , units = "hours" ) ) , by = ID ]
#    ID                V1
# 1:  1                NA
# 2:  1 0.971388888888889
# 3:  1 0.997777777777778
# 4:  1  1.00138888888889
# 5:  2                NA

edited Dec 13 '13 at 12:25

answered Dec 12 '13 at 13:47

Simon O'Hanlon

58,647
14
142
184

Can you explain why you would have to reference 'tail(dt,-1)' and the head version? I do not understand why this would work. I am not clear as to why you would have to refer to the "last" or tail but not the very last one (-1)?? Also, what is the NA_character_ saying? Is this a command or referring to something else? – Kerry Dec 13 '13 at 11:06
Also, when I run this command, it seems to produce a "new" table? not sure, but basically it does not simply add an additional column to the end of the data.table? I have tried to add something like `X$timediff<-X[ , c( NA_character_ , difftime( tail( dt , -1 ) , head( dt , -1 ) , units = "hours" ) ) , by = ID ]` and get an error message and the new column of information is more like a list of 3 new elements associated with the 1 variable. Very confusing. – Kerry Dec 13 '13 at 11:45
@Kerry you are confusing `data.frame` and `data.table` syntax. Don't do `X$timediff <- ...`. `X[ , c( NA_character_ , ...` is correct. Run it, then have a look at `X` afterwards. You will see a new column. The `NA_character_ is simply the character version of `NA` which is needed to coerce the result to a `character` vector because otherwise `data.table` complains about a mismatch in datatypes (numeric and character). – Simon O'Hanlon Dec 13 '13 at 12:20
As for `head` and `tail`... try it! e.g. `x <- 1:5; head( x , -1 ); tail( x , -1 )`. The first one returns the vector without the last element and the use of `tail` with `-1` returns a vector without the first element so we can subtract a subsequent time from the time immediately preceeding it. (try `tail(x,-1) - head(x,-1)` it's equivalent to `2:5 - 1:4`, so the result is `1 1 1 1`. – Simon O'Hanlon Dec 13 '13 at 12:23
I also updated so it returns the column as a new column in the orginal `data.table`. I hope that is clear now. – Simon O'Hanlon Dec 13 '13 at 12:25

score 1 · Accepted Answer · answered Dec 12 '13 at 16:43

Isn't X[, diff := c(NA,round(diff(dt)/60)),by=ID] simpler? There doesn't seem to be a time penalty.

f1 <- function(X){return(X[, diff := {tmp = diff(dt); units(tmp) <- "hours"; c(NA, as.numeric(tmp))}, by=ID][])}
f2 <- function(X){return(X[, diff := c(NA,round(diff(dt)/60)),by=ID])}

library(microbenchmark)
microbenchmark(f1(X),f2(X))
# Unit: milliseconds
#   expr      min       lq   median       uq      max neval
#  f1(X) 4.676918 4.772861 5.233032 5.324829 7.387008   100
#  f2(X) 4.615325 4.854294 5.161371 5.383165 7.147151   100

I thought of this too, but wanted to keep the attribute of it being a "time" object if possible. So just in case next time I wanted to know how many days or months occurred between the locations. This method works for sure, just not easily convertible to other units of measure — Kerry, Dec 13 '13 at 11:02

using data.table to calculate and format time difference between rows by group

2 Answers2