I'm working with a very large tibble and want to calculate the % of growth of those tables over time (first entry to last entry, not max to min). I would also ultimately want to store any tables with 0 change to their own list/tibble but remove them from the original output table.
an example of the dataset looks like this:
date tbl_name row_cnt
2/12/2019 first 247
6/5/2019 first 247
4/24/2019 second 3617138
6/5/2019 second 3680095
3/1/2019 third 62700321
6/5/2019 third 63509189
4/24/2019 fourth 2
6/5/2019 fourth 2
... ... ...
and the expected output of the table would be two tables that would appear as such:
tbl_name pct_change
second 1.74
third 1.29
... ...
tbl_name
first
fourth
...
so far I have been able to arrange the observations, group them, and filter the first and last instance of each group successfully with:
test_df <- df %>%
arrange(l.fully_qualf_tbl_nm) %>%
group_by(l.fully_qualf_tbl_nm) %>%
filter(row_number()==1 | row_number()==n()) %>%
mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>%
select(l.run_dt, l.fully_qualf_tbl_nm, pct_change) %>%
drop_na(pct_change)
but my calculation
mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>%
is not generating the right results. I pulled my pct-change calculation from another SO post that discusses %-change but i'm getting different numbers from my hand-calculations.
For example, I'm getting "second = 3.61" but a hand-calculation (as well as excel) gets 1.74. I'm also getting "third = 0.831" instead of 1.29 by-hand. My guess is that I'm not properly specifying that I only want the calculation done on each group (each pair of two rows). I'm wondering if I should be calculating lag separately or if I am just implementing lag() incorrectly?
next, I think the new table would be created with some manner of
if return value of filter(row_number()==1 | row_number()==n()) %>% == 0, append to list/table
but I honestly, have no clue how to do this. I'm wondering if I should just do a separate function and assign it to a new variable.