Create sequential subgroup_ID within each group_ID depending on a column

Question

I am struggling in finding the solution to a very simple task that needs to be run over 10 millions records.

Assuming the following data set:

mydf <- structure(list(group_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 
4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 
7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 
9), element_index= c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L), value= c(8045762L, 259L, 155L, 167L, 
110L, 175L, 135L, 0L, 0L, 0L, 0L, 150L, 0L, 0L, 115L, 0L, 0L, 
396L, 11175L, 0L, 0L, 0L, 261L, 0L, 170L, 0L, 576L, 5807L, 0L, 
280L, 48663L, 0L, 0L, 497L, 7298L, 0L, 441L, 160725L, 0L, 0L, 
0L, 0L, 335L, 0L, 0L, 0L, 0L, 0L, 0L, 356L, 35462L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 265L, 0L, 0L, 360L, 780L, 0L, 0L, 0L, 371L, 48394L, 
0L, 0L, 0L, 341L, 0L, 0L, 386L)), .Names = c("group_ID", "element_index", 
"value"), class = "data.frame", row.names = c(NA, 75L))

Basically, the main concepts are that:
1. the first element element of each group_ID is always to subgroup_ID == 1,
2. elements with value == 0 must not be considered in increasing the subgroup_ID;
3. the subgroup_id start from 1 at the second element with value != 0 and increase by 1 each time there is another value != 0 (starting from 1 at the second element with value != 0);
4. element with value == 0 are associated to the first next element with value != 0. Observing the picture, this means that element 2 and 3 are assigned to the subgroup_ID of element 4.

The solution is the following:

subgroup_ID = c(1,1,2,3,4,5,6,7,7,7,7,7,8,8,8,9,9,9,1,1,1,1,1,2,2,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,2,2,2)
solution_df <- data.frame(mydf, subgroup_ID)

The objective of this question is to give a subgroup_ID to divide each group in segments whereas the rule to create the subgroup_ID is the following:
- the first element of each group_ID is always 1
- the subgroup_ID increase by 1 each time there is an element with value != 0

I hope the question was clear, please do not hesitate to ask for clarifications.

I think your explanation doesn't match your graph and your desired output. You mentioned "the subgroup_ID increase by 1 each time there is an element with value != 0" in the end, but before you said "start from 1 at the second value != 0 and increase by 1 each time there is another value != 0", which matches your graph and desired output..... — AntoniosK, Dec 11 '17 at 14:58
Still a bit tricky, as sometimes `0` values increase the ID. For example for the first `group_ID` row 8 goes from 6 to 7 and then remains 7... — AntoniosK, Dec 11 '17 at 15:58
You are right, this was a factor I introduced in the image but I did not describ it properly in the text. I need to clarify that the element `value == 0` are associated to the `subgroup_ID` of the first next element with `value != 0`. Considering the picture, this means that element 2 and 3 belongs to the `subgroup_ID` of element 4. And element `value == 0` between element 4 and 5, belong to the `subgroup_ID` of element 5 — Seymour, Dec 11 '17 at 16:08

G. Grothendieck · Answer 1 · 2017-12-11T15:56:45.610

2

Here we are assuming that the rule for any group is to replace the second non-zero element of value with 0 and then form the result by starting with 1 and incrementing by 1 each time we encounter a subsequent non-zero.

Since the first element of value in each group is always non-zero according to the comment we can find the second non-zero by temporarily replacing the first element with zero and then searching for the first non-zero in what is left.

No packages are used.

Seq <- function(x) {
     x[head(which(replace(x, 1, 0) != 0), 1)] <- 0
     cumsum(x != 0)
}
transform(mydf, subid = ave(value, group_ID, FUN = Seq))

giving the same answer as shown in the question:

   group_ID element_index value subid
1         1             1   123     1
2         1             2     0     1
3         1             3     0     1
4         1             4   456     1
5         1             5   214     2
6         2             1    20     1
7         2             2     0     1
8         2             3    30     1
9         3             1    10     1
10        3             2     0     1
11        3             3    10     1
12        3             4    20     2

edited Dec 11 '17 at 15:56

answered Dec 11 '17 at 14:50

G. Grothendieck

254,981
17
203
341

thank you for your prompt reply, however, the result is different from `solution_df`. Please refer to the image I posted to explain those elements belonging to `group_ID == 1`. It shows that the first element is always `value != 0` and is always `subgroup_ID == 1`. Then, the sequential indexing starts to increase from the second element `value != 0`. Very important to bear in mind is that the index of second element for which `value != 0` is `subgroup_ID == 1` as well, then, **starting from here** each next element with `value != 0` will increase the `subgroup_ID` by 1. – Seymour Dec 11 '17 at 14:58
I hope I clarified – Seymour Dec 11 '17 at 14:59
I do not understand why but this solution associate the element `value == 0` to the previous point instead of the next point as it is shown in the picture. Why? – Seymour Dec 11 '17 at 15:53
Considering the picture, assuming there are three element with `value == 0` between point 4 and point 5. These code associate such three element with 'value == 0` to the `subgroup_ID` of point 4 instead of assigning them to the `subgroup_ID` of point 5. Did I clarify? – Seymour Dec 11 '17 at 15:58
1

I have clarified in the answer that my understanding is that what you want is to replace the second non-zero in value with 0 and then perform a cumulative sum which starts at 1 and increments by 1 each time there is a non-zero. – G. Grothendieck Dec 11 '17 at 16:09

Roman · Answer 2 · 2017-12-11T15:29:32.437

You can also try a tidyverse solution

library(tidyverse)
mydf %>% 
  group_by(group_ID) %>%
  mutate(value2=ifelse(row_number() == 1, 0, value)) %>% 
  mutate(subgroup_ID=lag(value2, default = 0) > 0) %>% 
  mutate(subgroup_ID=cumsum(subgroup_ID)+1) %>% 
  select(-value2)
# A tibble: 12 x 4
# Groups:   group_ID [3]
   group_ID element_index value subgroup_ID
      <dbl>         <dbl> <dbl>       <dbl>
 1        1             1   123           1
 2        1             2     0           1
 3        1             3     0           1
 4        1             4   456           1
 5        1             5   214           2
 6        2             1    20           1
 7        2             2     0           1
 8        2             3    30           1
 9        3             1    10           1
10        3             2     0           1
11        3             3    10           1
12        3             4    20           2

thank you for you reply @Jimbou. Unfortunately, this solution as well is not identical to `solution_df`. Specifically: (A) the element 4 of `group_ID` 1, should be `subgroup_ID == 1`; (B) the element 3 of `group_ID` 2 should be `subgroup_ID == 1`; (C) the element 3 of `group_ID` 3 should be `subgroup_ID == 1`. Please refer to my comment to Grothendieck. Now I try to edit the question as well — Seymour, Dec 11 '17 at 15:03

AntoniosK · Answer 3 · 2017-12-11T15:07:21.357

group_ID <- c(1,1,1,1,1,2,2,2,3,3,3,3)
element_index <- c(1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4)  #the element are ordered within each group_ID
value <- c(123, 0, 0, 456, 214, 20, 0, 30, 10, 0, 10, 20)
mydf <- data.frame(group_ID, element_index, value)


library(dplyr)

mydf %>%
  group_by(group_ID) %>%
  mutate(v_upd = cumsum(ifelse(value * lag(value, default = 0) != 0, 1, 0)) + 1) %>%
  ungroup()

# # A tibble: 12 x 4
#   group_ID element_index value v_upd
#      <dbl>         <dbl> <dbl> <dbl>
# 1        1             1   123     1
# 2        1             2     0     1
# 3        1             3     0     1
# 4        1             4   456     1
# 5        1             5   214     2
# 6        2             1    20     1
# 7        2             2     0     1
# 8        2             3    30     1
# 9        3             1    10     1
# 10       3             2     0     1
# 11       3             3    10     1
# 12       3             4    20     2

In order to better understand the process check this (similar) one that stores each step as a variable:

mydf %>%
  group_by(group_ID) %>%                             # for each group ID
  mutate(lag1_value = lag(value, default = 0)) %>%   # get the previous value of "value"
  mutate(v = ifelse(value * lag1_value != 0, 1, 0),  # for both current and previous value is different than 0 flag as 1
         v_upd = cumsum(v)+1) %>%                    # get cummulative sum of flags and add 1
  ungroup()                                          # forget the grouping

# # A tibble: 12 x 6
#   group_ID element_index value lag1_value     v v_upd
#      <dbl>         <dbl> <dbl>      <dbl> <dbl> <dbl>
# 1        1             1   123          0     0     1
# 2        1             2     0        123     0     1
# 3        1             3     0          0     0     1
# 4        1             4   456          0     0     1
# 5        1             5   214        456     1     2
# 6        2             1    20          0     0     1
# 7        2             2     0         20     0     1
# 8        2             3    30          0     0     1
# 9        3             1    10          0     0     1
# 10       3             2     0         10     0     1
# 11       3             3    10          0     0     1
# 12       3             4    20         10     1     2

You have 10m records, so I suggest you try all answers you'll get and use the faster one :-) — AntoniosK, Dec 11 '17 at 15:08
unfortunately the solution do not work on the real data. Apparently it do not indexes correctly the first group_ID and then it breaks. I improved the quality of the sample data. — Seymour, Dec 11 '17 at 15:33
Will have a look. But, looks like dataset has 75 rows and you've posted 31 new group IDs... — AntoniosK, Dec 11 '17 at 15:38
Yes sorry my connection broke, here is the solution for these 75 rows. — Seymour, Dec 11 '17 at 15:43

Create sequential subgroup_ID within each group_ID depending on a column

3 Answers3