R lead lag function summarize within group and calculate percent

Question

Here is what my data frame looks like:

And here is the dput structure of it.

structure(list(tier_1 = c("Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Search", "Organic Search", "Organic Search", "Organic Search", 
"Organic Social", "Organic Social", "Organic Social", "Organic Social", 
"Organic Social", "Organic Social", "Organic Social", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Search", "Paid Search", 
"Paid Search", "Paid Search", "Paid Search", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social", "Paid Social", "Paid Social", "Paid Social", 
"Paid Social", "Paid Social"), sequence_number = c(1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L), count_of_sequence_numbers = c(1176L, 460L, 119L, 41L, 21L, 
5L, 8L, 6L, 2L, 1L, 133L, 52L, 11L, 2L, 2L, 1L, 1L, 7516L, 1090L, 
284L, 90L, 36L, 21L, 12L, 6L, 2L, 1979L, 674L, 99L, 30L, 11L, 
2L, 3L, 2L, 1L), percent = c(0.637744034707158, 0.249457700650759, 
0.0645336225596529, 0.022234273318872, 0.0113882863340564, 0.0027114967462039, 
0.00433839479392625, 0.00325379609544469, 0.00108459869848156, 
0.000542299349240781, 0.655172413793103, 0.25615763546798, 0.0541871921182266, 
0.00985221674876847, 0.00985221674876847, 0.00492610837438424, 
0.00492610837438424, 0.827662151745402, 0.120030833608633, 0.0312740887567449, 
0.00991080277502478, 0.00396432111000991, 0.00231252064750578, 
0.0013214403700033, 0.000660720185001652, 0.000220240061667217, 
0.704019921736037, 0.23977232301672, 0.0352187833511206, 0.0106723585912487, 
0.00391319815012451, 0.000711490572749911, 0.00106723585912487, 
0.000711490572749911, 0.000355745286374956)), row.names = c(NA, 
-35L), groups = structure(list(tier_1 = c("Organic Search", "Organic Social", 
"Paid Search", "Paid Social"), .rows = structure(list(1:10, 11:17, 
    18:26, 27:35), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

df <- df %>% 
  group_by(tier_1, sequence_number) %>%
  summarize(count_of_sequence_numbers = length(sequence_number)) %>%
  mutate(percent = count_of_sequence_numbers / sum(count_of_sequence_numbers)) %>%
  filter(sequence_number <= 10)

I was able to come up with the percent column by using the code above, specifically the part about the count / sum(count).

I do have a problem, though, which is that the percent is incorrect. The values from count_of_sequence_numbers, when referring to sequence_number = 2 should be subtracted from the values from count_of_sequence_numbers when sequence_number = 1 (within the same category). And everything from count_of_sequence_numbers, when referring to sequence_number = 3 should be subtracted from both the count_of_sequence_numbers when sequence_number = 2 and when sequence_number = 3.

What I'm saying is, I really need a count of sequence numbers that, for sequence_number = 1, doesn't include 2-10, and when it's 2, doesn't include 3-10, etc. The 1176 value should really be 1176 - 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And the 460 value should be 460 - 119 - 41 - 21 - 5 -8 - 6 -2 -1. And then the percent should be calculated off that.

I tried a lead function but I just don't think this can be the efficient way. :/ That -1175 number in particular is making me nervous.

df <- df %>%
    group_by(tier_1) %>%
    arrange(tier_1, sequence_number) %>%
    mutate(diff = count_of_sequence_numbers - lead(count_of_sequence_numbers, default = first(count_of_sequence_numbers)))

If I change to lead(count_of_sequence_numbers, default = 0)) I get better behavior but it's still not quite what I'm trying to do, which is to subtract the value by the sum of all the others in the same group that have a greater sequence number.

Change to `lead(count_of_sequence_numbers, default = 0))` since the last item should be comparing against zero instead of the max count. — Jon Spring, Jun 12 '21 at 16:58
Thank you for the feedback. This behavior is definitely better but it's still not quite what I'm trying to do. I updated the post to show your answer in action — hachiko, Jun 12 '21 at 17:07

score 1 · Accepted Answer · answered Jun 12 '21 at 17:18

Is this the output you're looking for?

df %>%
  arrange(tier_1, -sequence_number) %>%
  group_by(tier_1) %>%   # already grouped this way, only including for clarity
  mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
         diff = count_of_sequence_numbers - cuml) %>%
  ungroup()


## A tibble: 35 x 6
#   tier_1         sequence_number count_of_sequence_numbers  percent  cuml  diff
#   <chr>                    <int>                     <int>    <dbl> <dbl> <dbl>
# 1 Organic Search              10                         1 0.000542     0     1
# 2 Organic Search               9                         2 0.00108      1     1
# 3 Organic Search               8                         6 0.00325      3     3
# 4 Organic Search               7                         8 0.00434      9    -1
# 5 Organic Search               6                         5 0.00271     17   -12
# 6 Organic Search               5                        21 0.0114      22    -1
# 7 Organic Search               4                        41 0.0222      43    -2
# 8 Organic Search               3                       119 0.0645      84    35
# 9 Organic Search               2                       460 0.249      203   257
#10 Organic Search               1                      1176 0.638      663   513
## … with 25 more rows

I think the math works really great thank you for this code. I'm super confused because I thought that sequence number could only be 10 if there was a corresponding sequence_number = 9 ... which can't be true, I don't think, if there are negative numbers here. I don't think that's your fault, though — hachiko, Jun 12 '21 at 17:32

R lead lag function summarize within group and calculate percent

1 Answers1