I am struggling in finding the solution to a very simple task that needs to be run over 10 millions records.
Assuming the following data set:
mydf <- structure(list(group_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9,
9), element_index= c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L), value= c(8045762L, 259L, 155L, 167L,
110L, 175L, 135L, 0L, 0L, 0L, 0L, 150L, 0L, 0L, 115L, 0L, 0L,
396L, 11175L, 0L, 0L, 0L, 261L, 0L, 170L, 0L, 576L, 5807L, 0L,
280L, 48663L, 0L, 0L, 497L, 7298L, 0L, 441L, 160725L, 0L, 0L,
0L, 0L, 335L, 0L, 0L, 0L, 0L, 0L, 0L, 356L, 35462L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 265L, 0L, 0L, 360L, 780L, 0L, 0L, 0L, 371L, 48394L,
0L, 0L, 0L, 341L, 0L, 0L, 386L)), .Names = c("group_ID", "element_index",
"value"), class = "data.frame", row.names = c(NA, 75L))
Basically, the main concepts are that:
1. the first element element of each group_ID is always to subgroup_ID == 1
,
2. elements with value == 0
must not be considered in increasing the subgroup_ID
;
3. the subgroup_id
start from 1
at the second element with value != 0
and increase by 1
each time there is another value != 0
(starting from 1 at the second element with value != 0
);
4. element with value == 0
are associated to the first next element with value != 0
. Observing the picture, this means that element 2 and 3 are assigned to the subgroup_ID of element 4.
The solution is the following:
subgroup_ID = c(1,1,2,3,4,5,6,7,7,7,7,7,8,8,8,9,9,9,1,1,1,1,1,2,2,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,2,2,2)
solution_df <- data.frame(mydf, subgroup_ID)
The objective of this question is to give a subgroup_ID
to divide each group in segments whereas the rule to create the subgroup_ID
is the following:
- the first element of each group_ID
is always 1
- the subgroup_ID
increase by 1
each time there is an element with value != 0
I hope the question was clear, please do not hesitate to ask for clarifications.