0

I have a dataframe with multiple columns where one of the columns is of class POSIXct. I would like to remove rows from my dataframe where the row's date/time (as determined from the POSIXct column) has not been preceded by a date/time within the past 24 hours, NOT including the previous 3 hours.

In Excel, I can do this rather easily by creating a new column like this:

=IF(COUNTIFS(datetimecolumn, "<" & currentdatetime, datetimecolumn, ">" & (currentdatetime-1), datetimecolumn, "<" & (currentdatetime-3/24)) > 0, 1, 0)

and then removing accordingly.

I can see doing a "for-loop" with "if-statements" in R and accomplishing the same task, but I'm wondering if there is a more parsimonious method using, say, data.table or dplyr. Here is an example of what my data looks like with my Excel solution at the far right, where the 0's are keepers, and the 1's are to be removed.

datetime       test
7/24/2012 12:15 0 #First point, so no issues
7/24/2012 15:00 0 #Even though this point is within 24 hours of the previous point, it is less than 3 hours, so it's OK
7/24/2012 15:15 0 #Ditto for this point
7/24/2012 15:30 1 #Now this point is out of the three hour window, so it's bad
7/24/2012 16:00 1 #Ditto for this point
7/24/2012 17:00 1 #Ditto for this point
7/24/2012 17:30 1 #Ditto for this point
7/28/2012 20:15 0 #This point has no previous points within 24 hours, so OK
7/29/2012 6:30  1 #This point has a previous point within 24 hours that is also not in a previous 3 hour window, so it's bad
7/30/2012 16:30 0 #This point has no previous points within 24 hours, so OK
7/30/2012 16:45 0
7/30/2012 17:00 0
7/30/2012 17:15 0
7/30/2012 17:30 0
7/30/2012 17:45 0
7/30/2012 18:00 0
7/30/2012 18:15 0
7/31/2012 16:45 1
8/2/2012 20:15  0
8/3/2012 16:00  1
8/4/2012 17:45  0
8/4/2012 18:00  0
8/4/2012 18:30  0
8/4/2012 19:15  0
8/4/2012 19:30  0
8/4/2012 19:45  0
8/4/2012 20:30  0
8/5/2012 9:15   1
8/5/2012 9:30   1

Any help is greatly appreciated. Thank you!


Data, kindly provided by @jeremycg:

data = structure(list(datetime = structure(c(1343146500, 1343156400, 
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900, 
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800, 
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000, 
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900, 
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA, 
-29L), class = "data.frame")
Frank
  • 66,179
  • 8
  • 96
  • 180
  • I do not understand the "NOT including the previous 3 hours" part of the condition. – Roland Oct 09 '15 at 17:10
  • Maybe `dput` your data so we can test, and comment on the rows in your code block above explaining why test = 0 or 1, where appropriate to illustrate. – Frank Oct 09 '15 at 17:12
  • I think it's "Take the first data point, and any within 3 hours of it, then no data points until the next one at least 24 hours after the last taken time point." Is this correct? – jeremycg Oct 09 '15 at 17:13
  • convert your first column to dates using as.Date(), it appears from your excel that you're using something called 'currenttime' this is Sys.time() in R which means that you can create your function. You can then sapply() your function to your data.frame. In fact, you don't even need the apply functions or to create a function, if you put the logical expression in df[logical expression, ] it will return the row if true and leave it out if blank. We're all struggling to work out your logical expression though... – DarrenRhodes Oct 09 '15 at 17:14
  • Hey y'all, thanks for your comments. @jeremycg has it right, Roland. Frank, the code block that I pasted uses the Excel solution that I pasted, where in the Excel solution "datetimecolumn" is "datetime" and "currentdatetime" is the datetime at each row. – user2943039 Oct 09 '15 at 17:19
  • If @jeremycg has it right than I don't understand how the given solution matches this condition. Maybe I'm just too tired. – Roland Oct 09 '15 at 17:25
  • @Roland - Actually I see that jeremycg is not exactly right. The data is sorted chronologically. The first condition that I have is that I only want rows where there are no preceding rows with 24 hours of the current row. An example of this is the "1" in "test" on 7/29/2012 6:30, because there is a row within the previous 24 hours (7/28/2012 20:15). The second condition is that it's OK to have rows in the preceding 24 hours, as long as they within 3 hours of the current row. An example of this are the "0's" on 7/24/2012 15:00 and 7/24/2012 15:15. – user2943039 Oct 09 '15 at 18:20
  • @Frank I've added comments to help explain why test = 0 or 1. To reiterate, the "test" column uses my Excel code solution. One confusing thing about this is that 1s are bad and 0s are good. – user2943039 Oct 09 '15 at 18:44

1 Answers1

1

I think this is what you want. First, convert your data to proper date format:

data$datetime <- as.POSIXct(data$datetime, format = "%m/%d/%Y %R")

Then we make a column which finds each time point without anything within 24 hours then takes the cumsum to group_by (initialgroup). Then we find each of the members of these groups that are within 3 hours of the start.

I think your use of 0 to keep and 1 to exclude is causing a little confusion, as R has the opposite as its default (ie as.numeric(TRUE) is 1), but I'll keep it your way.

library(dplyr)

data %>% mutate(initialgroup = cumsum(c(24*60, diff(datetime)) >= 24*60)) %>%
         group_by(initialgroup) %>%
         mutate(ingroup = +((datetime - datetime[1]) > 180*60)) 

Which gives:

              datetime test initialgroup ingroup
1  2012-07-24 12:15:00    0            1       0
2  2012-07-24 15:00:00    0            1       0
3  2012-07-24 15:15:00    0            1       0
4  2012-07-24 15:30:00    1            1       1
5  2012-07-24 16:00:00    1            1       1
6  2012-07-24 17:00:00    1            1       1
7  2012-07-24 17:30:00    1            1       1
8  2012-07-28 20:15:00    0            2       0
9  2012-07-29 06:30:00    1            2       1
10 2012-07-30 16:30:00    0            3       0
11 2012-07-30 16:45:00    0            3       0
12 2012-07-30 17:00:00    0            3       0
13 2012-07-30 17:15:00    0            3       0
14 2012-07-30 17:30:00    0            3       0
15 2012-07-30 17:45:00    0            3       0
16 2012-07-30 18:00:00    0            3       0
17 2012-07-30 18:15:00    0            3       0
18 2012-07-31 16:45:00    1            3       1
19 2012-08-02 20:15:00    0            4       0
20 2012-08-03 16:00:00    1            4       1
21 2012-08-04 17:45:00    0            5       0
22 2012-08-04 18:00:00    0            5       0
23 2012-08-04 18:30:00    0            5       0
24 2012-08-04 19:15:00    0            5       0
25 2012-08-04 19:30:00    0            5       0
26 2012-08-04 19:45:00    0            5       0
27 2012-08-04 20:30:00    0            5       0
28 2012-08-05 09:15:00    1            5       1
29 2012-08-05 09:30:00    1            5       1

data used (after data time conversion):

structure(list(datetime = structure(c(1343146500, 1343156400, 
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900, 
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800, 
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000, 
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900, 
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA, 
-29L), class = "data.frame")
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • Thank you, I will test this and get back to you if it works! – user2943039 Oct 09 '15 at 18:22
  • Better to use `as.integer` than `as.numeric` if converting logical to 0/1. Btw, I copied your dput into the OP. – Frank Oct 09 '15 at 18:52
  • Thanks, I'll use the `+` you told me a while back I think – jeremycg Oct 09 '15 at 18:53
  • Ok cool. I'm all for unary `+`, but `as.integer` is no worse, practically, and won't attract the ire of anti-code-golfers. – Frank Oct 09 '15 at 18:54
  • @jeremycg - this code is not working for me. I am getting an error: Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "function" – user2943039 Oct 10 '15 at 02:00
  • is your data called `data` or something else? – jeremycg Oct 10 '15 at 02:09
  • @jeremycg,thank you so much. I got it working. I really appreciate learning from your code, specifically the %>% operator and also the clever use of cumsum(diff())!! – user2943039 Oct 12 '15 at 13:43