2

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.

library(tidyverse)

df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))

> df
  time id
1    1  1
2    2  2
3    3  3
4    4  4
5    5  5
6  510  6
7  511  7
8  512  8
9  513  9

df <- df %>% 
  mutate(t_diff = c(NA, diff(time))) %>% 
  # This generates an error as dialog is not available as a variable at this point
  mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))

# This is the desired result

> df
  time id t_diff dialog
1    1  1     NA      1
2    2  2      1      1
3    3  3      1      1
4    4  4      1      1
5    5  5      1      1
6  510  6    505      6
7  511  7      1      6
8  512  8      1      6
9  513  9      1      6

In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.

Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.

Is there a way to achieve this in dplyr?

sge
  • 7,330
  • 1
  • 15
  • 18
  • 1
    `df %>% mutate(group = cumsum(coalesce(t_diff, 0) > 500)) %>% group_by(group) %>% mutate(dialog = first(id))`? – Axeman Jun 17 '19 at 20:09
  • In base R `transform(df, dialog = ave(id, cumsum(c(0, diff(df$time)) > 500), FUN = function(x) x[1]))` perhaps – markus Jun 17 '19 at 20:10
  • @Axeman This works brilliantly! Thank you! – sge Jun 17 '19 at 21:29

0 Answers0