2

My question is similar to dplyr: grouping and summarizing/mutating data with rolling time windows and I have used this for reference but have not been successful in manipulating it enough for what I need to do.

I have data that looks something like this:

a <- data.table("TYPE" = c("A", "A", "B", "B",
                       "C", "C", "C", "C",
                       "D", "D", "D", "D"), 
            "DATE" = c("4/20/2018 11:47",
                       "4/25/2018 7:21",
                       "4/15/2018 6:11",
                       "4/19/2018 4:22",
                       "4/15/2018 17:46",
                       "4/16/2018 11:59",
                       "4/20/2018 7:50",
                       "4/26/2018 2:55",
                       "4/27/2018 11:46",
                       "4/27/2018 13:03",
                       "4/20/2018 7:31",
                       "4/22/2018 9:45"),
            "CLASS" = c(1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4))

From this I ordered the data first by TYPE and then by DATE and created a column that just contains the date and ignores the time from the DATE column:

a <- a[order(TYPE, DATE), ]
a[, YMD := date(a$DATE)]

Now I am trying to use the TYPE column and YMD column to produce a new column. Here is the criteria I am trying to meet:
1) Maintain all columns from the original data set
2) Create a new column called say EVENTS
3) For each TYPE if it occurs more than n times within 30 days then put Y in the EVENTS column for each TYPE and YMD that made the group qualify and N otherwise. (Note this is for n unique dates, so it must have n unique days within 30 days to qualify).

This would be the expected output if n = 4:

Expected Output

This is as close of an example that I have, but it does not account for unique days and it does not preserve all of the columns in the table:

a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
  inner_join(.,., by="TYPE") %>%
  group_by(TYPE, DATE.x) %>%
  summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))

Any suggestions are appreciated.

Update

Both of the answers below worked for my original example data, however, if we add a few more instances of D then they both mark all of D as 1 instead of marking the first 4 instances 0 and the last 4 instances 1 this is where the "rolling window" comes into play.

Updated data set:

a <- data.table("TYPE" = c("A", "A", "B", "B",
                       "C", "C", "C", "C",
                       "D", "D", "D", "D",
                       "D", "D", "D", "D"), 
            "DATE" = c("4/20/2018 11:47",
                       "4/25/2018 7:21",
                       "4/15/2018 6:11",
                       "4/19/2018 4:22",
                       "4/15/2018 17:46",
                       "4/16/2018 11:59",
                       "4/20/2018 7:50",
                       "4/26/2018 2:55",
                       "4/27/2018 11:46",
                       "4/27/2018 13:03",
                       "4/20/2018 7:31",
                       "4/22/2018 9:45",
                       "6/01/2018 9:07",
                       "6/03/2018 12:34",
                       "6/07/2018 1:57",
                       "6/10/2018 2:22"),
            "CLASS" = c(1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4))

The new update expected output would be:

Updated Expected Output

Bear
  • 662
  • 1
  • 5
  • 20

2 Answers2

1

Here is a solution with dplyr:

Update based on OP edit

library(dplyr)
library(lubridate)
a <- data.frame("TYPE" = c("A", "A", "B", "B",
                           "C", "C", "C", "C",
                           "D", "D", "D", "D",
                           "D", "D", "D", "D"), 
                "DATE" = c("4/20/2018 11:47",
                           "4/25/2018 7:21",
                           "4/15/2018 6:11",
                           "4/19/2018 4:22",
                           "4/15/2018 17:46",
                           "4/16/2018 11:59",
                           "4/20/2018 7:50",
                           "4/26/2018 2:55",
                           "4/27/2018 11:46",
                           "4/27/2018 13:03",
                           "4/20/2018 7:31",
                           "4/22/2018 9:45",
                           "6/01/2018 9:07",
                           "6/03/2018 12:34",
                           "6/07/2018 1:57",
                           "6/10/2018 2:22"),
                "CLASS" = c(1, 2, 3, 4,
                            1, 2, 3, 4,
                            1, 2, 3, 4,
                            1, 2, 3, 4))

# a function to flag rows that are 4th or more within window w
count_window <- function(df, date, w, type){
  min_date <- date - w
  df2 <- df %>% filter(TYPE == type, YMD >= min_date, YMD <= date)
  out <- n_distinct(df2$YMD)
  res <- ifelse(out >= 4, 1, 0)
  return(res)
}

v_count_window <- Vectorize(count_window, vectorize.args = c("date","type"))

res <- a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
  mutate(YMD = date(DATE)) %>% 
  arrange(TYPE, YMD) %>% 
  #group_by(TYPE) %>% 
  mutate(min_date = YMD - 30,
         count = v_count_window(., YMD, 30, TYPE)) %>% 
  group_by(TYPE) %>% 
  mutate(FLAG = case_when(
    any(count == 1) & YMD >= min_date[match(1,count)] ~ 1,
    TRUE ~ 0
  ))%>% 
  select(nms,FLAG)

I couldn't figure out how to use the group in a custom function so I hard coded the filtering by type into the function.

see24
  • 1,097
  • 10
  • 21
  • Is there a shortcut for selecting all of the columns from the original data set `a`? (So you do not have to type them all out)? – Bear Jul 18 '18 at 18:13
  • 1
    You could do something like `nms <- colnames()` at the beginning and then change the select at the end to `select(nms)` – see24 Jul 18 '18 at 18:17
  • Not to be petty, but if it worked could you accept my answer? Unless you are hoping for a data.table version? – see24 Jul 19 '18 at 15:58
1

Using data.table would be like this:

a[,DATE:=as.Date(a$DATE,format="%m/%d/%Y %H:%M")]
a <- a[order(TYPE, DATE), ]

fun1 <- function(x,n){ #Creating a function for any n
x[,.(DATE,CLASS, EVENTS=if((max(DATE)-min(DATE))<=30 #first condition
                    & (length(unique(DATE)))>=n) #second condition
                    1 else 0),by=TYPE]
}

fun1(a,4)
         TYPE       DATE CLASS EVENTS
 1:    A 2018-04-20     1      0
 2:    A 2018-04-25     2      0
 3:    B 2018-04-15     3      0
 4:    B 2018-04-19     4      0
 5:    C 2018-04-15     1      1
 6:    C 2018-04-16     2      1
 7:    C 2018-04-20     3      1
 8:    C 2018-04-26     4      1
 9:    D 2018-04-20     3      0
10:    D 2018-04-22     4      0
11:    D 2018-04-27     1      0
12:    D 2018-04-27     2      0
Chriss Paul
  • 1,101
  • 6
  • 19
  • @KAS based on your conditions, you don't have for category D more than 4 unique occurrences "within 30 days" (the range they happen in category D is 51 days) – Chriss Paul Jul 18 '18 at 20:38
  • 6/1/2018, 6/03/2018, 6/07/2018, 6/10/2018 qualifies as 1 set in a rolling 30-day window because there are 4 different dates that all occur within 30 days. – Bear Jul 18 '18 at 20:44
  • Basically it needs to start at the first date and check "Is there 4 dates within 30 days from this date? Yes or No" then move to next date. So, rolling date window of 30 days. – Bear Jul 18 '18 at 20:53