0

I need to identify a series (maximum 3 events) of events that occurred within 60 seconds.

Here there is the IN data

IN<-read.table(header = FALSE, text = "
2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

and here there is the desired output

OUT<-read.table(header = FALSE, text = "
2018-06-01_04:29:47        1
2018-06-01_05:44:41        1
2018-06-01_05:44:43        2
2018-06-01_05:44:45        3
2018-06-01_05:57:54        1
2018-06-01_05:57:56        2
2018-06-01_05:57:58        3
2018-06-01_08:10:35        1
2018-06-01_08:41:20        1
2018-06-01_08:41:22        2
2018-06-01_08:41:24        3
2018-06-01_08:52:01        1
2018-06-01_09:02:13        1
2018-06-01_09:22:45        1
",quote="\n",col.names=c("time","response"))

I have searched for similar questions, but unsuccessfully. I guess that function diff is the first step for solving this problem,

response<-as.numeric(diff(IN$time)>60)

but than I have no idea how to proceed to get the desired output.

Any helps will be appreciated.

stefano
  • 601
  • 1
  • 8
  • 14

2 Answers2

2

Here's a solution using dplyr, magrittr, and lubridate packages.

IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
               2018-06-01_05:44:41
               2018-06-01_05:44:43
               2018-06-01_05:44:45
               2018-06-01_05:57:54
               2018-06-01_05:57:56
               2018-06-01_05:57:58
               2018-06-01_08:10:35
               2018-06-01_08:41:20
               2018-06-01_08:41:22
               2018-06-01_08:41:24
               2018-06-01_08:52:01
               2018-06-01_09:02:13
               2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

I've removed the blank first line of the input data frame, as it caused problems. The following function filters the data frame to those elements within 60 seconds before the given ref_time and counts the number of rows using nrow.

event_count <- function(ref_time){
  IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow
}

Here, I apply the function in a row-wise fashion, record the counts, and sort according to time. (Probably unnecessary...) The results are piped back in to the input data frame using the compound assignment pipe from magrittr.

IN %<>% 
  rowwise() %>% 
  mutate(counts = event_count(time)) %>% 
  arrange(time)

Finally, the results.

# A tibble: 14 x 2
#    time                counts
#    <dttm>               <int>
# 1  2018-06-01 04:29:47      1
# 2  2018-06-01 05:44:41      1
# 3  2018-06-01 05:44:43      2
# 4  2018-06-01 05:44:45      3
# 5  2018-06-01 05:57:54      1
# 6  2018-06-01 05:57:56      2
# 7  2018-06-01 05:57:58      3
# 8  2018-06-01 08:10:35      1
# 9  2018-06-01 08:41:20      1
# 10 2018-06-01 08:41:22      2
# 11 2018-06-01 08:41:24      3
# 12 2018-06-01 08:52:01      1
# 13 2018-06-01 09:02:13      1
# 14 2018-06-01 09:22:45      1

I think what @PoGibas is alluding to is for some reason there are two entries with the time 2018-06-01 05:57:54 in the input data frame. I'm not sure where the second comes from...


EDIT: It's the new line in the read table that messes it up.

EDIT²: This returns a maximum of 3...

event_count <- function(ref_time){
  min(IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow, 3)
}
Dan
  • 11,370
  • 4
  • 43
  • 68
  • I'm not sure this does it, as the count isn't reset when you go from line 4 to line 5 – zack Jun 06 '18 at 17:23
  • 1
    @zack Because there are two counts at `2018-06-01 05:57:54`, as I mentioned. When I run the given code two appear for some reason, although it clearly isn't in the data frame. Bizarre. Try running the `read.table` code... It's not just me, right?! – Dan Jun 06 '18 at 17:24
  • Ah, strange but i see what you're talking about. I've never seen `rowwise()` before. Nice. As a note, OP did say "maximum 3 events", so it may not completely answer the question, but it's certainly movement in the right direction – zack Jun 06 '18 at 17:29
  • I got this error when running the code of @Lyngbakr: Error in c(1527820187, 1527824681, 1527824683, 1527824685, 1527825474, : error in evaluating the argument 'b' in selecting a method for function '%within%': Error in as.POSIXct.numeric(x, tz = tz) : 'origin' must be supplied – stefano Jun 06 '18 at 17:51
  • @stefano Weird. I just reran it with a fresh session (and relevant library calls at the beginning) and it worked fine. Do you have another package loaded with an `interval` function masking the one from `lubridate`? – Dan Jun 06 '18 at 17:54
  • @Lyngbakr, it is weird also because I have no idea where it comes from the first record in line 1 of the OUT when running the code. No idea, really! – stefano Jun 06 '18 at 18:01
  • @Lyngbakr, no other package loaded with interval function. – stefano Jun 06 '18 at 18:06
  • @stefano Can you try with a fresh, clean session and just the code above (+ the library calls)? Unfortunately, I can't reproduce the problem. – Dan Jun 06 '18 at 18:09
  • 1
    @Lyngbakr. Solved, it was necessary to update the dplyr package. Very nice and elegant solution. – stefano Jun 06 '18 at 18:25
1

Here's a data frame with some edge cases:

IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
           2018-06-01_05:44:41
           2018-06-01_05:44:43
           2018-06-01_05:44:45
           2018-06-01_05:44:47
           2018-06-01_05:57:54
           2018-06-01_05:57:56
           2018-06-01_05:57:58
           2018-06-01_05:58:56
           2018-06-01_08:10:35
           2018-06-01_08:41:20
           2018-06-01_08:41:22
           2018-06-01_08:41:24
           2018-06-01_08:52:01
           2018-06-01_09:02:13
           2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

IN
                  time
1  2018-06-01 04:29:47
2  2018-06-01 05:44:41
3  2018-06-01 05:44:43
4  2018-06-01 05:44:45
5  2018-06-01 05:44:47
6  2018-06-01 05:57:54
7  2018-06-01 05:57:56
8  2018-06-01 05:57:58
9  2018-06-01 05:58:56
10 2018-06-01 08:10:35
11 2018-06-01 08:41:20
12 2018-06-01 08:41:22
13 2018-06-01 08:41:24
14 2018-06-01 08:52:01
15 2018-06-01 09:02:13
16 2018-06-01 09:22:45

You'll notice line 9 is a minute after the mid-group time but not the reference time. Line 5 is also the 4th member of what would be a group if there were no limits imposed.

Here's my solution using dplyr. I think it works generally speaking:

res <- IN %>% mutate(diffs = as.numeric(time - lag(time)),
                     helper1 = case_when(is.na(diffs) ~ 1,
                                         diffs <= 60 ~ 0 ,
                                         TRUE ~ 1),
                     grouper1 = cumsum(helper1)) %>%
  group_by(grouper1) %>%
  mutate(helper2 = cumsum(diffs) - first(diffs),
         helper3 = helper2 %/% 60,
         helper4 = helper1 + if_else(is.na(helper3), 0, helper3)) %>%
  ungroup() %>%
  mutate(grouper2 = cumsum(helper4)) %>%
  group_by(grouper2) %>%
  mutate(rn0 = row_number() - 1,
         grouper3 = rn0 %/% 3) %>%
  group_by(grouper2, grouper3) %>%
  mutate(count = row_number()) %>%
  ungroup() %>%
  select(time, count)

the result:

> res
# A tibble: 16 x 2
   time                count
   <dttm>              <int>
 1 2018-06-01 04:29:47     1
 2 2018-06-01 05:44:41     1
 3 2018-06-01 05:44:43     2
 4 2018-06-01 05:44:45     3
 5 2018-06-01 05:44:47     1
 6 2018-06-01 05:57:54     1
 7 2018-06-01 05:57:56     2
 8 2018-06-01 05:57:58     3
 9 2018-06-01 05:58:56     1
10 2018-06-01 08:10:35     1
11 2018-06-01 08:41:20     1
12 2018-06-01 08:41:22     2
13 2018-06-01 08:41:24     3
14 2018-06-01 08:52:01     1
15 2018-06-01 09:02:13     1
16 2018-06-01 09:22:45     1

I think i structured the dplyr calls in a way where you can follow them, but if you have questions feel free to post in comments.

zack
  • 5,205
  • 1
  • 19
  • 25
  • I got this error when running your code Error in mutate_impl(.data, dots) : non trovo la funzione "case_when". No idea why I am getting this error. – stefano Jun 06 '18 at 18:08
  • is `dplyr` loaded? just making sure (i.e., `library(dplyr)`) - if it was, is it possible it's not up to date? `case_when` is from the `dplyr` library – zack Jun 06 '18 at 18:10
  • yes loaded. May be the following objects masked are the culprits?> library(dplyr) Attaching package: ‘dplyr’ The following object is masked from ‘package:stats’: filter The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union – stefano Jun 06 '18 at 18:13
  • hmmm, try replacing `case_when` with `dplyr::case_when` – zack Jun 06 '18 at 18:13