0

This question is similar to a question already posted few days ago, Collapse rows from 0 to 0

The new twist here which is different from the previous question is this, how do we collapse rows by Id for only those rows where the time different is less than or equal to 60.

For example, using the same dataset

Incident.ID..                date           product
INCFI0000029582     2014-09-25 08:39:45     foo
INCFI0000029582     2014-09-25 08:39:45     foo
INCFI0000029582     2014-09-25 08:39:48     bar 
INCFI0000029582     2014-09-25 08:40:44     foo
INCFI0000029582     2014-10-10 23:04:00     foo
INCFI0000029587     2014-09-25 08:33:32     bar
INCFI0000029587     2014-09-25 08:34:41     bar
INCFI0000029587     2014-09-25 08:35:24     bar
INCFI0000029587     2014-10-10 23:04:00     foo


df <- structure(list(Incident.ID.. = c("INCFI0000029582", "INCFI0000029582","INCFI0000029582", 
"INCFI0000029582", "INCFI0000029582", "INCFI0000029587", "INCFI0000029587", 
"INCFI0000029587", "INCFI0000029587"), date = c("2014-09-25 08:39:45","2014-09-25 08:39:45", 
"2014-09-25 08:39:48", "2014-09-25 08:40:44", "2014-10-10 23:04:00", 
"2014-09-25 08:33:32", "2014-09-25 08:34:41", "2014-09-25 08:35:24", 
"2014-10-10 23:04:00"), product = 
c("foo","foo","bar","foo","foo","bar","bar","bar","foo")), 
class = "data.frame", row.names = c(NA, 
-L))

This calculates the time difference by ID

 library(dplyr)
 library(lubridate)
 df1 <- df %>%
  group_by(Incident.ID..) %>%
  arrange(ymd_hms(date)) %>%
  mutate(diff = c(0, diff(ymd_hms(date))))

Which results in this new column diff as shown below

Incident.ID..   date                 product    diff
INCFI0000029582 2014-09-25 08:39:45  foo        0
INCFI0000029582 2014-09-25 08:39:45  foo        0
INCFI0000029582 2014-09-25 08:39:48  bar        3
INCFI0000029582 2014-09-25 08:40:44  foo        56
INCFI0000029582 2014-10-10 23:04:00  foo        1347796
INCFI0000029587 2014-09-25 08:33:32  bar        0
INCFI0000029587 2014-09-25 08:34:41  bar        69
INCFI0000029587 2014-09-25 08:35:24  bar        43
INCFI0000029587 2014-10-10 23:04:00  foo        1348116

Now only collapsing rows by Incident.ID.. where the time difference is less than or equal to 60, i.e diff <= 60 should result in a final dataset like this

 Incident.ID..     DateMin              DateMax              product      diff_collapse
 INCFI0000029582   2014-09-25 08:39:45  2014-09-25 08:40:44  foo,bar,foo  0,0,3,56
 INCFI0000029582   2014-09-25 08:40:44  2014-10-10 23:04:00  foo          1347796
 INCFI0000029587   2014-09-25 08:33:32  2014-09-25 08:34:41  bar          0
 INCFI0000029587   2014-09-25 08:34:41  2014-09-25 08:35:24  bar,bar      69,43
 INCFI0000029587   2014-09-25 08:35:24  2014-10-10 23:04:00  foo          1348116

Looking for some help on how to create such a collapsed dataset. Thanks in advance.

dan1st
  • 12,568
  • 8
  • 34
  • 67
Sundown Brownbear
  • 491
  • 1
  • 5
  • 15
  • In your result, I don't understand why `0` is included in the `diff_collapse` in row 1, but excluded in the next group (rows 3 and 4 are separate). – Gregor Thomas Feb 04 '19 at 20:17
  • 1
    Why is row 4 collapsed when there is a value above 60 (values are 69 and 43)? – tmfmnk Feb 04 '19 at 20:27

2 Answers2

0

You need a grouping column that meets your needs:

... %>% mutate(
  grp = ifelse(diff <= 60,
               paste0(Incident.ID.., "origin"), 
               paste0(Incident.ID.., diff)
  ))

This creates a grouper that is the same (within Incident.ID..) for rows where the diff is less than 60, and is unique otherwise. (Assumes diff is unique---if you might have duplicated diffs greater than 60, use row_number() instead of diff in the paste to make sure it is unique.) Use that as the grouping column for your collapse code.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
0

I'd suggest making a new grouping variable. I got the desired result like so:

df1 <- df %>%
  group_by(Incident.ID..) %>%
  arrange(ymd_hms(date)) %>%
  mutate(diff = c(0, diff(ymd_hms(date)))) %>%
  ungroup() %>%
  arrange(Incident.ID.., date) %>%
  mutate(group = cumsum(diff > 60 | diff == 0)) %>%
  group_by(group) %>%
  summarise(DateMin = min(date), 
            DateMax = max(date), 
            diff_collapse = toString(diff),
            product = toString(product))

I'm basically deciding at which row a new group should start through the condition diff > 60 | diff == 0: diff > 60 because that's the collapse condition, and diff == 0 because that's when a new incident starts. You could also write Incident.ID.. != lag(Incident.ID..). Wrapping it in cumsum increases the counter every time a new group starts.

It's important to ungroup first, otherwise the cumsum only works in-group.

A. Stam
  • 2,148
  • 14
  • 29
  • That means the `cumsum` starts counting anew in every `Incident.ID..`. You could then solve that by including `Incident.ID..` in the `group_by`. – A. Stam Feb 04 '19 at 20:24
  • Ah, nevermind, I see. I still had my answer in mind which is using `paste` to make sure the different incidents don't get grouped. – Gregor Thomas Feb 04 '19 at 20:31
  • @A.Stam, thanks a ton for the proposed solution. However there is a minor glitch. This approach leaves out some rows that are within the time difference (delta) < 60. I have updated the dataset with one extra row where the time difference is 0 (basically duplicate, **2nd row in the dataset**) and the code is not including this case in the group or in other words not including this in the collapsed version. Sorry for a confusing explanation. Please run your code above and you will see what is missing compared to the expected output. – Sundown Brownbear Feb 12 '19 at 05:11