0

I'm working with an ecological dataset that has multiple individuals moving across a landscape where they can be detected at multiple sites. The data has a beginning and ending timestamp when an individual was detected at a given site; heron we'll call this time window for an individual at a site an "event". These events are the rows in this data. I sorted this data by time, and noticed I can have multiple events while an individual remains at a given site (which can be due to an individual moving away from the receiver and coming back to it while not being detected at an adjacent receiver).

Here's example data for a single individual, x:

input <- data.frame(individual = c("x","x","x","x","x","x","x"), 
                site = c("a","a","a","b","b","a", "a"), 
               start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-01-14 11:13:10", "2020-01-14 11:16:20", 
                                 "2020-02-14 11:11:11", "2020-02-14 11:13:10",
                                 "2020-03-14 11:12:11", "2020-03-15 11:12:11")), 
               end_time = as.POSIXct(c("2020-01-14 11:11:41", "2020-01-14 11:13:27", "2020-01-14 11:16:50", 
                                 "2020-02-14 11:13:11", "2020-02-14 11:15:10",
                                 "2020-03-14 11:20:11", "2020-03-15 11:20:11")))

I want to aggregate these smaller events (e.g. the first 3 events at site a) into one larger event where I summarize the start/end times for the whole event:

output <- data.frame(individual = c("x","x","x"), site = c("a", "b", "a"), 
                 start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-02-14 11:11:11", "2020-03-14 11:12:11")), 
                 end_time = as.POSIXct(c("2020-01-14 11:16:50", "2020-02-14 11:15:10",  "2020-03-15 11:20:11")))

Note that time intervals for events vary.

Using group_by(individual, site) would mean losing this temporal info, since individuals can travel among sites multiple times. I thought about using some sort of helper dataframe that summarizes events for individuals at sites but I am not sure how to retain the temporal info. I suppose there is a way to do this by indexing row numbers/looping in base but I am hoping there is a nifty dplyr trick that can help with this problem.

DylanMG
  • 45
  • 6
  • Nifty `data.table` trick: `rleid(site)`. See also [Is there a dplyr equivalent to data.table::rleid?](https://stackoverflow.com/questions/33507868/is-there-a-dplyr-equivalent-to-data-tablerleid) – Henrik Feb 08 '21 at 22:24
  • @Henrik that works great thank you and all the rest. My brain was having a hard time with this yesterday! – DylanMG Feb 09 '21 at 16:11

3 Answers3

1

One approach would be to take the cumulative sum of times that site has changed, and use that count to summarize each individual's contiguous times at one site.

library(dplyr)
  
input %>%
  arrange(individual, start_time) %>%
  mutate(indiv_new_site = cumsum(site != lag(site, default = ""))) %>%
  group_by(individual, site, indiv_new_site) %>%
  summarize(start_time = min(start_time),
            end_time = max(end_time))


# A tibble: 3 x 5
# Groups:   individual, site [2]
  individual site  indiv_new_site start_time          end_time           
  <chr>      <chr>          <int> <dttm>              <dttm>             
1 x          a                  1 2020-01-14 11:11:11 2020-01-14 11:16:50
2 x          a                  3 2020-03-14 11:12:11 2020-03-15 11:20:11
3 x          b                  2 2020-02-14 11:11:11 2020-02-14 11:15:10
  
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

We could use rle from base R

library(dplyr)
input %>%
    arrange(individual, start_time) %>%
    group_by(individual, site, grp = with(rle(site), 
         rep(seq_along(values), lengths))) %>% 
    summarize(start_time = min(start_time),
         end_time = max(end_time), .groups = 'drop') %>%
    select(-grp)

-output

# A tibble: 3 x 4
#  individual site  start_time          end_time           
#  <chr>      <chr> <dttm>              <dttm>             
#1 x          a     2020-01-14 11:11:11 2020-01-14 11:16:50
#2 x          a     2020-03-14 11:12:11 2020-03-15 11:20:11
#3 x          b     2020-02-14 11:11:11 2020-02-14 11:15:10
akrun
  • 874,273
  • 37
  • 540
  • 662
0

In data.table we can use rleid.

library(data.table)

setDT(input)
input[, .(site = first(site),
          start_time = min(start_time), 
          end_time = max(end_time)), .(individual, rleid(site))]

#   individual rleid site          start_time            end_time
#1:          x     1    a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2:          x     2    b 2020-02-14 11:11:11 2020-02-14 11:15:10
#3:          x     3    a 2020-03-14 11:12:11 2020-03-15 11:20:11
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213