My question is similar to dplyr: grouping and summarizing/mutating data with rolling time windows and I have used this for reference but have not been successful in manipulating it enough for what I need to do.
I have data that looks something like this:
a <- data.table("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45"),
"CLASS" = c(1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4))
From this I ordered the data first by TYPE
and then by DATE
and created a column that just contains the date and ignores the time from the DATE
column:
a <- a[order(TYPE, DATE), ]
a[, YMD := date(a$DATE)]
Now I am trying to use the TYPE
column and YMD
column to produce a new column. Here is the criteria I am trying to meet:
1) Maintain all columns from the original data set
2) Create a new column called say EVENTS
3) For each TYPE
if it occurs more than n
times within 30 days then put Y
in the EVENTS
column for each TYPE
and YMD
that made the group qualify and N
otherwise. (Note this is for n
unique dates, so it must have n
unique days within 30 days to qualify).
This would be the expected output if n = 4
:
This is as close of an example that I have, but it does not account for unique days and it does not preserve all of the columns in the table:
a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
inner_join(.,., by="TYPE") %>%
group_by(TYPE, DATE.x) %>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))
Any suggestions are appreciated.
Update
Both of the answers below worked for my original example data, however, if we add a few more instances of D
then they both mark all of D
as 1
instead of marking the first 4 instances 0
and the last 4 instances 1
this is where the "rolling window" comes into play.
Updated data set:
a <- data.table("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45",
"6/01/2018 9:07",
"6/03/2018 12:34",
"6/07/2018 1:57",
"6/10/2018 2:22"),
"CLASS" = c(1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4))
The new update expected output would be: