Add conditional group identifier using rollup functions

Question

I have a data frame that has sub sequences (groups of rows) and the condition to identify these sub sequences is to watch for a surge in the column diff. This is what the data looks like :

> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
    events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
    2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
    "mousedown", "mousemove", "mouseup"), class = "factor"), 
    deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
    "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
    "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
    "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
    "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
    "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
    "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
    "93365", "93370", "93374", "93872"), class = "factor"), 
    serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
    19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
    312, 4211, 5, 4, 384)), 
    .Names = c("vid", "events", "deltas", "serial", "diff"),
    row.names = c(NA, 20L), class = "data.frame")

I am trying to add a column that will indicate when a new sub sequence is identified and assign the entire sub sequence a unique id. I'll demonstrate the criterion for the grouping with the following example:
The diff value of row 5 is 6829 which is 10 times higher than the max value until that row (283). The result should be something like this df:

structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
    events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
    2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
    "mousedown", "mousemove", "mouseup"), class = "factor"), 
    deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
    "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
    "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
    "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
    "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
    "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
    "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
    "93365", "93370", "93374", "93872"), class = "factor"), serial = c(1, 
    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
    19, 20), 
    diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 
    3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384), 
    group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)), 
    .Names =  c("vid", "events", "deltas", "serial", "diff", "group"), 
    row.names = c(NA, 20L), class = "data.frame")

Any help greatly appreciated

How about `df$group <- cumsum(df$diff > 500) + 1` (your whatever criteria you specify). — Gopala, Apr 17 '16 at 15:10
It works! but I don't understand why :-) the cumsum is only getting larger as R process row further down the df ?! don't see how this works, but it did — Nir Regev, Apr 17 '16 at 15:41

score 0 · Answer 1 · answered Apr 17 '16 at 15:43

0

courtesy of the user Gopala : How about df$group <- cumsum(df$diff > 500) + 1 (your whatever criteria you specify). – Gopala 31 mins ago

answered Apr 17 '16 at 15:43

Nir Regev

135
1
2
7

score 0 · Accepted Answer · answered Apr 17 '16 at 16:53

Let me give you a bit more detail on why it works and how it works.

First, let us just add a column without the cumsum part:

df$tag <- df$diff > 500
head(df)
                                   vid    events deltas serial diff   tag
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE
3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE
4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE

As you can see, it simply creates a logical of TRUE/FALSE values in the tag column that says whether or not the difference is 'big enough' (based on selected threshold).

Now, when you do cumsum on that column and store it in group column, it will keep cumulatively adding. Every TRUE value will increment the cumulative sum by 1 and every FALSE value will keep the cumulative sum the same as it was before that row was hit.

So, this will give you the desired incrementing group values:

df$group <- cumsum(df$tag)
head(df)
                                   vid    events deltas serial diff   tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     0
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     0
3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     0
4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     0
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     1
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     1

Notice that the group value starts at zero. Since cumulative sum of the first few FALSE values is zero. But, you may want your group identifiers to start with 1 instead. So, I added a 1 to the cumsum, but you can also do it as follows as an extra step.

df$group <- df$group + 1
head(df)
                                   vid    events deltas serial diff   tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     1
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     1
3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     1
4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     1
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     2
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     2

Hope this helps.

Add conditional group identifier using rollup functions

2 Answers2