Filter rows by a time threshold

Question

I have a dataset organized this way:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P2   A             2015-03-16 19:34:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P6   A             2015-03-16 21:19:00
P7   A             2015-03-16 21:33:00
P8   B             2015-03-16 21:35:00
P9   B             2015-03-16 23:43:00

I want to select independent pictures for each species (that is, pictures separated from each other by 1h), in this dataset with R.

In this example, for species A, I would only want to keep P1, P3 and P4. P2 wouldn't be considered because it falls within the 1h period that started with P1. P3 is considered since its DateTime (19h58) falls after 19h42. And now, the next 1h period would last until 20h58. For species B, only P5 and P9.

Therefore, after this filter, my dataset would look like this:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P9   B             2015-03-16 23:43:00

Does someone know how to perform this in R?

Related: [Subset observations that differ by at least 30 minutes time](https://stackoverflow.com/questions/36918158/subset-observations-that-differ-by-at-least-30-minutes-time); [Subset time series so that selected rows differs by a certain minimum time](https://stackoverflow.com/questions/41816629/subset-time-series-so-that-selected-rows-differs-by-a-certain-minimum-time); [How to filter rows based on difference in dates between rows in R?](https://stackoverflow.com/questions/39317354/how-to-filter-rows-based-on-difference-in-dates-between-rows-in-r?noredirect=1&lq=1) — Henrik, Feb 27 '18 at 21:19

Maurits Evers · Answer 1 · 2018-02-28T11:20:29.057

2

Here is dplyr solution:

require(dplyr);
df %>%
    arrange(Species, DateTime) %>%
    group_by(Species) %>%
    mutate(
        DateTime = as.POSIXct(DateTime),
        diff = abs(lag(DateTime) - DateTime),
        diff = ifelse(is.na(diff), 0, diff),
        cumdiff = cumsum(as.numeric(diff)) %/% 60,
        x = abs(lag(cumdiff) - cumdiff)) %>%
    filter(is.na(x) | x > 0) %>%
    select(ID, Species, DateTime) %>%
    ungroup() %>%
    as.data.frame()
#  ID Species            DateTime
#1 P1       A 2015-03-16 18:42:00
#2 P3       A 2015-03-16 19:58:00
#3 P4       A 2015-03-16 21:02:00
#4 P5       B 2015-03-16 21:18:00
#5 P9       B 2015-03-16 23:43:00

Sample data

df <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
P2   A             '2015-03-16 19:34:00'
P3   A             '2015-03-16 19:58:00'
P4   A             '2015-03-16 21:02:00'
P5   B             '2015-03-16 21:18:00'
P6   A             '2015-03-16 21:19:00'
P7   A             '2015-03-16 21:33:00'
P8   B             '2015-03-16 21:35:00'
P9   B             '2015-03-16 23:43:00'", header = T);

edited Feb 28 '18 at 11:20

answered Feb 27 '18 at 21:42

Maurits Evers

49,617
4
47
68

@Teresa *"It seems to have worked but, as with the other answers, I get a tibble that only shows the first 10 rows. 1) How can I see the rest of it, and 2) how can I use the tibble for further analyses?"* What do you mean??? *You* gave sample data with 10 rows; that was by *your* design. If `df` is your full `data.frame` the methods should work provided the rules still hold. A `tibble` is essentially the same as a `data.frame`. You can always convert it to the latter with `as.data.frame`. – Maurits Evers Feb 28 '18 at 08:52
I provided an example of the data frame, not the whole file. In the code that you provided, where should I include the 'as.data.frame' to convert it? Because I only need the records provided in the tibble – mto23 Feb 28 '18 at 09:33
@Teresa Ok, I think I understand. When you print a `tibble` it will only *show* the first 10 rows (and as many columns as fit on one screen); however it will still *contain* your full data. I've edited my solution to include a conversion to `data.frame`. Please take a look. – Maurits Evers Feb 28 '18 at 11:19
I'm not sure if I did it right because it didn't seem to be working (the "new" database didn't consider the filter). But it does work if I do " df1 <- df %>% ... ". So, problem solved! – mto23 Feb 28 '18 at 15:28
However, I was looking more carefully at the resulting database from the code that you provided and, although some pics are excluded, others still dependent are considered. I believe it is only excluding the next dependent picture, and not the set of pictures dependent on the first one. Do you know a way to fix it in the code? Thank you! – mto23 Feb 28 '18 at 15:42

score 1 · Accepted Answer · answered Feb 27 '18 at 21:04

There may be a more elegant way to do it, but this works:

library(dplyr)

isHourApart <- function(dt) {
    min <- 0
    keeps <- c()
    for (d in dt) {
        if (d >= min + 60 * 60) {
            min <- d
            keeps <- c(keeps, TRUE)
        } else {
            keeps <- c(keeps, FALSE)
        }
    }
    keeps
}


df %>% 
    group_by(Species) %>% 
    filter(isHourApart(DateTime))

> df
# A tibble: 5 x 3
# Groups:   Species [2]
  ID    Species DateTime           
  <chr> <fct>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Note that the DateTime column is of class POSIXct.

I just included `arrange(Specie, DateTimeOriginal) %>%` before `group_by(Species) %>%`; otherwise, it was not deleting all records. Besides that, it works very well! — mto23, Feb 28 '18 at 16:37

score 1 · Answer 3 · answered Feb 27 '18 at 21:53

Here's one way of doing it using data.table:

library(data.table)
library(lubridate)

df1 <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
                 P3   A             '2015-03-16 19:58:00'
                 P4   A             '2015-03-16 21:02:00'
                 P5   B             '2015-03-16 21:18:00'
                 P9   B             '2015-03-16 23:43:00'", 
                 header = TRUE, stringsAsFactors = FALSE)

setDT(df1)
df1[, DateTime := ymd_hms(DateTime)]
df1[, date_range := DateTime + 60 * 60]
df2 <- copy(df1)
df2[, date := DateTime]
df2[, DateTime := NULL]
df <- df2[df1, .(ID, Species, date = x.date, DateTime, date_range), on=.(ID, Species, date >= DateTime, date <= date_range), nomatch = 0L, allow.cartesian = TRUE]
df[, c("date", "date_range") := NULL]

   ID Species            DateTime
1: P1       A 2015-03-16 18:42:00
2: P3       A 2015-03-16 19:58:00
3: P4       A 2015-03-16 21:02:00
4: P5       B 2015-03-16 21:18:00
5: P9       B 2015-03-16 23:43:00

score 0 · Answer 4 · answered Feb 28 '18 at 21:38

We can simply create a new column with 60 minutes intervals and then keep the first ocurrence for each Species.

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1)

Output1

# A tibble: 5 x 4
# Groups:   Species, by60 [5]
  ID    Species DateTime            by60               
  <chr> <chr>   <dttm>              <fct>              
1 P1    A       2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4    A       2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5    B       2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9    B       2015-03-16 23:43:00 2015-03-16 23:42:00

If we'd like to drop that dummy column:

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>% 
  ungroup() %>% 
  select(-by60)

Output2

# A tibble: 5 x 3
  ID    Species DateTime           
  <chr> <chr>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Filter rows by a time threshold

4 Answers4

Sample data

Linked