Find overlaps in data table

Question

I have some data with IDs, dates and integer values for associated ID and start date combinations, there are multiple dates per ID.

I would like to create a column indicating:

1) Telling me if an ID has a sum >= 14 from the integers, or a count of 4 separate integers in a 12 month period.

There is a similar question here, but the categories of mine are a bit more complicated: Create new column based on condition that exists within a rolling date

Any help greatly appreciated!

Here is a dput of some data:

structure(list(ID = c("90939293", "90963328", "90092983", 
"90032926", "90944838", "90092983", "90062392", "90224939", "90202398", 
"90926203", "90936043", "90329263", "90944838", "90232033", "90980903", 
"90924463", "90299292", "90933383", "90209349", "90092983", "90022988", 
"90022293", "90933383", "90092983", "90299240", "90963033", "90004923", 
"90292998", "90986096", "90980903", "90336692", "90933383", "90022988", 
"90069992", "90062392", "90209248", "90924463", "90092983", "90933383", 
"90022293", "90062392", "90004923", "90233269", "90329263", "90229202", 
"90309943", "90299292", "90036820", "90329263", "90232033", "90329263", 
"90336692", "90963033", "90224939", "90924463", "90069992", "90092983", 
"90934923", "90926203", "90222333", "90092983", "90299292", "90202398", 
"90004923", "90233269", "90926203", "90222333", "90224939", "90232033", 
"90933383", "90022293", "90022988", "90934923", "90069992", "90329263", 
"90209349", "90022293", "90309943", "90299240", "90022293", "90336692", 
"90020334", "90933383", "90290384", "90224939", "90980903", "90299240", 
"90299292", "90202398", "90022346"), Date = structure(c(15972, 
16009, 16010, 16010, 16007, 16010, 16006, 16010, 16007, 16008, 
15997, 16007, 16007, 16002, 16008, 16006, 16006, 16006, 16009, 
16010, 16006, 16006, 16006, 16010, 15995, 16008, 16008, 16010, 
16009, 16008, 16010, 16006, 16006, 16009, 16006, 16006, 16006, 
16010, 16006, 16006, 16006, 16008, 16009, 16007, 16010, 16007, 
16006, 16009, 16007, 16002, 16007, 16010, 16008, 16010, 16006, 
16009, 16010, 15936, 16008, 16008, 16010, 16006, 16007, 16008, 
16009, 16008, 16008, 16010, 16002, 16006, 16006, 16006, 15936, 
16009, 16007, 16009, 16006, 16007, 15995, 16006, 16010, 16006, 
16006, 16010, 16010, 16008, 15995, 16006, 16007, 16008), class = "Date"), 
    Integer = c(39, 2, 1, 1, 4, 1, 5, 1, 4, 3, 14, 4, 4, 9, 
    3, 5, 5, 5, 2, 1, 5, 5, 5, 1, 16, 3, 3, 1, 2, 3, 1, 5, 5, 
    2, 5, 5, 5, 1, 5, 5, 5, 3, 2, 4, 1, 4, 5, 2, 4, 9, 4, 1, 
    3, 1, 5, 2, 1, 75, 3, 3, 1, 5, 4, 3, 2, 3, 3, 1, 9, 5, 5, 
    5, 75, 2, 4, 2, 5, 4, 16, 5, 1, 5, 5, 1, 1, 3, 16, 5, 4, 
    3)), .Names = c("ID", "Date", "Integer"
), row.names = c("200086", "200066", "200050", "200064", "200078", 
"200050.1", "200069", "200082", "200083", "200053", "200056", 
"200055", "200078.1", "200079", "200051", "200089", "200052", 
"200057", "200061", "200050.2", "200060", "200080", "200057.1", 
"200050.3", "200068", "200071", "200070", "200059", "200062", 
"200051.1", "200067", "200057.2", "200060.1", "200072", "200069.1", 
"200073", "200089.1", "200050.4", "200057.3", "200080.1", "200069.2", 
"200070.1", "200081", "200054", "200063", "200075", "200052.1", 
"200074", "200054.1", "200079.1", "200055.1", "200067.1", "200071.1", 
"200082.1", "200089.2", "200072.1", "200050.5", "200084", "200053.1", 
"200088", "200050.6", "200052.2", "200083.1", "200070.2", "200081.1", 
"200053.2", "200088.1", "200082.2", "200079.2", "200057.4", "200080.2", 
"200060.2", "200084.1", "200072.2", "200055.2", "200061.1", "200080.3", 
"200075.1", "200068.1", "200080.4", "200067.2", "200065", "200057.5", 
"200090", "200082.3", "200051.2", "200068.2", "200052.3", "200083.2", 
"200076"), class = "data.frame")

"there are multiple dates per ID" - `any(duplicated(df$X1))` disagrees with you for your sample data. Your IDs (first column, I assume, they're just called `X1` in your example) are unique. Or did you mean multiple IDs for some dates? Either way, make a **small** example rather than 100 lines. — Spacedman, Oct 25 '16 at 14:28
This isn't clear: "Telling me if an ID has a sum of 14 integers or 4 separate integers in a 12 month period". What does "a sum of 14 integers" mean? 1+2+3+4+1+2+3+4+1+2+3+4+7+99 is a sum of 14 integers. You don't mean that do you? — Spacedman, Oct 25 '16 at 14:32
I think you are probably asking too many questions here, and SO discourages partial answers, so unless one person works their way through all your problems you won't get any answers. Suggest you delete this post and create several - the first will be how to find which IDs have sums of their `Integer` column values equal to 14. — Spacedman, Oct 26 '16 at 11:49
@Spacedman Hi, I have updated the question to reflect your comments — user124123, Oct 26 '16 at 14:46
that "count of 4 separate integers" needs clarification. Do you mean that from all the rows of a given ID, it is possible to find two dates exactly one year apart that contain the dates from exactly four rows with different integer values? Or any integer values? — Spacedman, Oct 26 '16 at 15:11
Any integer values. So I guess an interpretation is does an ID appear 4 times within a 12 month period. — user124123, Oct 26 '16 at 15:23
I guess you've used the site long enough to be familiar with the request for desired output...? — Frank, Oct 28 '16 at 15:40
This example can't be right because there are no instances of multiple unique dates for a single ID. If it's important enough for a bounty, why not take the time to write the question and example clearly so that we can help you. — C8H10N4O2, Oct 28 '16 at 17:24

score 2 · Answer 1 · edited May 23 '17 at 12:33

Here's a stab at what you're asking for. Now, finding IDs with the sum of their Integer greater than 14 is as easy as grouping by ID and checking if the sum of the Integer column for each ID is >= 14, or in dplyr: df %>% group_by(ID) %>% mutate(conditional = sum(Integer) >= 14). Finding an ID with (at least?) 4 in a 12 month period is obviously harder. My solution follows this answer in working out the windowing count.

There's just one caveat: since roll_sum works by rolling over the number of rows, the solution I use relies on there being one row only per day per ID. In your sample dataframe there's actually multiple entries for the same ID date, but they seem to be duplicates, so I removed them. In case they are not, and the duplicate values need to be computed for the condition of sum(Integer) >= 14, they can, instead of being removed, be summed up beforehand (e.g.: df %>% group_by(ID, Date) %>% summarize(Integer = sum(Integer))) so that there's only one entry per ID per Date.

library(dplyr)
library(tidyr)
library(RcppRoll)

df_tmp <- df
df <- df_tmp  %>% 
  group_by(ID, Date) %>% 
  filter(n() == 1) %>% # this line removes duplicate columns 
  ungroup() %>%
  complete(ID, 
           Date=seq(from=min(Date)-365,to=max(Date), by=1), 
           fill=list(Integer=0)) %>% # we use complete to add in a row for all IDs for every single date since a year before the first obs.
  arrange(ID, Date) %>%
  group_by(ID) %>% 
  mutate(roll_count = roll_sum(x = Integer != 0, n = 365, fill=0, align="right"), # this calculates the rolling sum using n = 365 as a stand-in for 12 months
         conditional = sum(Integer) >= 14 || roll_count >= 4 ) %>% 
  ungroup() %>%
  right_join(df, by = c("ID","Date", "Integer")) # right_join with the original data to remove dummy dates

Hope this helps!

score 2 · Accepted Answer · answered Oct 31 '16 at 08:32

with your dput as 'x' :

library(data.table)

setDT(x, key = "Date")

# test 1
x[, `:=` (
  test1 = sum(Integer) >= 14
), by = ID]

# test2
y = x[, .(
  count12 = uniqueN(Integer)
  ), by = .(start = Date, end = Date - 365)]

# combine
z = merge(x, y, by.x = "Date", by.y = "start")
z[, end := NULL]
z[, flag := test1 | count12 == 4]

Find overlaps in data table

2 Answers2