2

I work with electronically tagged fish. A snippet of my telemetry data (dataframe "d") is below. Each timestamp represents a detection for a unique fish.

TagID          Detection              Location      RiverKm
163            02/23/2012 03:17:44    Alcatraz_E     4.414
163            02/23/2012 03:56:25    Alcatraz_E     4.414
163            04/14/2012 15:10:20    Alcatraz_E     4.414
163            04/14/2012 15:12:11    Alcatraz_N     4.414
163            03/11/2012 08:59:48    Alcatraz_N     4.414
163            03/11/2012 09:02:15    Alcatraz_N     4.414
163            03/11/2012 09:04:05    Alcatraz_N     4.414
163            03/11/2012 09:04:06    Alcatraz_N     4.414
163            03/11/2012 09:06:09    Alcatraz_N     4.414
163            03/11/2012 09:06:11    Alcatraz_E     4.414

There many different TagIDs (individual fish). I'd like to categorize the detections into encounter periods for each fish, by identifying a start time ("arrival") and an end time ("departure"), with a critical value of 1 hour. For example, for the above fish (TagID 163), the output would be:

TagID       arrival                  departure            Location        RiverKm
163        02/23/2012 03:17:44    02/23/2012 03:56:25     Alcatraz_E       4.414 
163        04/14/2012 15:10:2     04/14/2012 15:12:11     Alcatraz_N       4.414
163        03/11/2012 08:59:48    03/11/2012 09:06:11     Alcatraz_E       4.414

I'd like to create a loop (or any other code structure) that does the following:

for j in 1:length(unique(d$TagID))
  1. Identify the time of the first detection ("t1")
  2. IF the next detection for that tag in the time series ("t2") is less than one hour apart from t1, skip it and continue to the next detection; ELSE, place t1 in an "arrival" vector and t2 in a "departure vector.
  3. Stop when every arrival and departure timestamp has been categorized for each TagID.

I have no idea how to do this in the most efficient way, and would appreciate your help immensely.

Thank you!

Von
  • 156
  • 1
  • 11

2 Answers2

2

You should first order your data by date. That's why you should convert your Detection variable to a valid r datetime type: POSIXct. once your data is ordered, using diff , and cumsum you can create a grouping variable for jump detection: here a jump is occurred after at least an hour(60 minutes). I am using data.table for sugar syntax in grouping operations but it is not necessary specially if you don't have a hudge amount of data.

Here my complete code:

library(data.table)
## data coerecion
d$Detection <- 
  as.POSIXct(strptime(d$Detection,'%m/%d/%Y %H:%M:%S'))
## sort using Detecetion
d <- d[order(d$Detection),]
# id is incrementing variable that detects a jump of an hour
d$id <- cumsum(c(F,round(diff(d$Detection)/60) >60))
## you don't mention how to choose location,Riverkm so I take by default the first ones
setDT(d)[,list(start   =Detection[1],
               end     =Detection[length(Detection)],
               Location=Location[1],
               RiverKm =RiverKm[1]),
         "TagID,id"]

#    TagID id               start                 end   Location RiverKm
# 1:   163  0 2012-02-23 03:17:44 2012-02-23 03:56:25 Alcatraz_E   4.414
# 2:   163  1 2012-03-11 08:59:48 2012-03-11 09:06:11 Alcatraz_N   4.414
# 3:   163  2 2012-04-14 15:10:20 2012-04-14 15:12:11 Alcatraz_E   4.414
agstudy
  • 119,832
  • 17
  • 199
  • 261
1

Here is a similar approach with dplyr (Version 0.3). I edited my codes with new functions from 0.3.

# If you need to download the latest development version
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")

library(dplyr)

foo <- data.frame(
    TagID = rep(c(163:164), each = 10),
    Detection = rep(c("02/23/2012 03:17:44", "02/23/2012 03:56:25", "04/14/2012 15:10:20",
                  "04/14/2012 15:12:11", "03/11/2012 08:59:48", "03/11/2012 09:02:15",
                  "03/11/2012 09:04:05", "03/11/2012 09:04:06", "03/11/2012 09:06:09",
                  "03/11/2012 09:06:11"), times = 2),
    Location = rep(c("Alcatraz_E", "Alcatraz_E", "Alcatraz_E", "Alcatraz_N", "Alcatraz_N",
                 "Alcatraz_N", "Alcatraz_N", "Alcatraz_N", "Alcatraz_N", "Alcatraz_E"),times = 2),
    RiverKm = 4.414,
    stringsAsFactors = FALSE)

foo$Detection <- as.POSIXct(strptime(foo$Detection,'%m/%d/%Y %H:%M:%S'))

foo %>%
    arrange(TagID, Detection) %>%
    group_by(TagID, id = cumsum(!duplicated(TagID) | c(F,round(diff(Detection)/60) > 60))) %>%
    slice(c(1,length(Detection))) %>%
    mutate(Departure = Detection[2]) %>%
    slice(1) %>%
    ungroup 


#  TagID           Detection   Location RiverKm id           Departure
#1   163 2012-02-23 03:17:44 Alcatraz_E   4.414  0 2012-02-23 03:56:25
#2   163 2012-03-11 08:59:48 Alcatraz_N   4.414  1 2012-03-11 09:06:11
#3   163 2012-04-14 15:10:20 Alcatraz_E   4.414  2 2012-04-14 15:12:11
#4   164 2012-02-23 03:17:44 Alcatraz_E   4.414  0 2012-02-23 03:56:25
#5   164 2012-03-11 08:59:48 Alcatraz_N   4.414  1 2012-03-11 09:06:11
#6   164 2012-04-14 15:10:20 Alcatraz_E   4.414  2 2012-04-14 15:12:11
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • I appreciate the use of dplyr - it's great to know this is possible without a for-loop. However, it looks like only a list of the first detections ("arrival") are returned, without the departures. Is there a way to keep the final time stamp (time of departure)? – Von Oct 09 '14 at 05:20
  • @Von I am sorry I did not read your question carefully. Let me revise this. – jazzurro Oct 09 '14 at 05:25
  • @Von I revised this script. In the end, the `data.table` approach requires much less typing. – jazzurro Oct 09 '14 at 06:49
  • got it, thank you!! More typing, but still perhaps the best approach for expanding functionality eventually. – Von Oct 09 '14 at 21:47
  • @Von It is up to you, but agstudy'a approach is faster and shorter. – jazzurro Oct 10 '14 at 00:18
  • @agstudy eventually I'd like to expand this function to divide the encounters into time of day (dawn, dusk, day, night), river stage (high tide, low tide), etc. I had heard dplyr was a good way to do that. That was all I meant. Since I'm asking the question, obviously I don't really know anything - I didn't mean to offend anyone. – Von Oct 11 '14 at 19:11
  • @agstudy All good. I was not offended at all. I appreciated your approach. I had a things to learn. I am currently learning data.table, and you taught me something I did not know.For this task, data.table offers a really simple approach. Thank you. – jazzurro Oct 11 '14 at 23:51
  • @Von After learning new functions in `dplyr` 0.3, I updated my code. I used `slice`. I also squeezed a part of the code by adding `id = cumsum...` part in the first `group_by`. – jazzurro Oct 16 '14 at 05:38
  • @jazzurro cool! I'm working on writing the actual function now - I'll post the final script when it's done. – Von Oct 16 '14 at 17:21
  • @jazurro do you think you could type out in words what the 2nd line of code does exactly? Mostly I'm confused by the "F"... – Von Oct 24 '14 at 17:22
  • @Von I am off to bed now. But, that F thing is identical to what Agstudy. F is FALSE. I think running small pieces of the codes will let you see what is happening. I myself has done that too. – jazzurro Oct 24 '14 at 17:50
  • @Von If you have more questions, let me know. – jazzurro Oct 26 '14 at 10:59