0

I'm working with a very large tibble and want to calculate the % of growth of those tables over time (first entry to last entry, not max to min). I would also ultimately want to store any tables with 0 change to their own list/tibble but remove them from the original output table.

an example of the dataset looks like this:

  date    tbl_name    row_cnt
2/12/2019  first       247
6/5/2019   first       247
4/24/2019  second    3617138
6/5/2019   second    3680095
3/1/2019   third    62700321
6/5/2019   third    63509189
4/24/2019  fourth       2
6/5/2019   fourth       2
...          ...       ...

and the expected output of the table would be two tables that would appear as such:

tbl_name   pct_change
second       1.74
third        1.29
...          ...


tbl_name
 first
 fourth
  ...

so far I have been able to arrange the observations, group them, and filter the first and last instance of each group successfully with:

test_df <- df %>% 
  arrange(l.fully_qualf_tbl_nm) %>% 
  group_by(l.fully_qualf_tbl_nm) %>%
  filter(row_number()==1 | row_number()==n()) %>%
  mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>% 
  select(l.run_dt, l.fully_qualf_tbl_nm, pct_change) %>% 
  drop_na(pct_change)

but my calculation

mutate(pct_change = ((l.row_cnt/lag(l.row_cnt) - 1) * 100)) %>%

is not generating the right results. I pulled my pct-change calculation from another SO post that discusses %-change but i'm getting different numbers from my hand-calculations.

For example, I'm getting "second = 3.61" but a hand-calculation (as well as excel) gets 1.74. I'm also getting "third = 0.831" instead of 1.29 by-hand. My guess is that I'm not properly specifying that I only want the calculation done on each group (each pair of two rows). I'm wondering if I should be calculating lag separately or if I am just implementing lag() incorrectly?

next, I think the new table would be created with some manner of

if return value of filter(row_number()==1 | row_number()==n()) %>% == 0, append to list/table

but I honestly, have no clue how to do this. I'm wondering if I should just do a separate function and assign it to a new variable.

Nick Bohl
  • 105
  • 3
  • 13
  • Can you explain more why you want two tables and what the difference is? – Jon Spring Jun 26 '19 at 17:08
  • The first output will show all table names in our dataset that grew throughout the tracked period of time and the percent of growth among those tables. The second output would simply print out the table names that didn't grow at all, just so all table names in the dataset are accounted for. – Nick Bohl Jun 26 '19 at 17:16
  • Please see answer below. I think the problem came from filtering your rows too early -- you can't calculate growth from period 1 to period 2 if you've removed period 1 already. – Jon Spring Jun 26 '19 at 17:27

1 Answers1

1
df <- read.table(
  header = T, 
  stringsAsFactors = F,
  text = " date    tbl_name    row_cnt
2/12/2019  first       247
6/5/2019   first       247
4/24/2019  second    3617138
6/5/2019   second    3680095
3/1/2019   third    62700321
6/5/2019   third    63509189
4/24/2019  fourth       2
6/5/2019   fourth       2")

# Wrapping in parentheses assigns the output to test_df and also prints it
(test_df <- df %>% 
    group_by(tbl_name) %>%
    mutate(pct_change = ((row_cnt/lag(row_cnt) - 1) * 100)) %>% 
    ungroup() %>%
    filter(!is.na(pct_change)) %>%  # Filter after pct_change calc, since we want to 
                                    # include change from 1:2  and from n-1:n
    select(tbl_name, row_cnt, pct_change))

# A tibble: 4 x 3
  tbl_name  row_cnt pct_change
  <chr>       <int>      <dbl>
1 first         247       0   
2 second    3680095       1.74
3 third    63509189       1.29
4 fourth          2       0  

To split into two tables, it seems one could do:

first_tbl <- test_df %>% filter(pct_change != 0) # or "pct_change > 0" for pos growth
second_tbl <- test_df %>% filter(pct_change == 0)
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thank you. I believe this is the right answer. Something weird is happening earlier in my script now, as I'm getting extremely odd %-growth calcs (the real version of my 'second' observation is coming to -98.5% growth when it should be 0) but that is something I'll need to figure out on my own. thank you for your help. – Nick Bohl Jun 26 '19 at 18:19
  • Actually, it looks like filtering AFTER the pct_change calc was the problem. if you filter, mutate, then ungroup(), the code runs in the manner I wanted it to. I'm not sure why it behaves differently with my real and example dataset, but ultimately I have the data I need. Thank you. – Nick Bohl Jun 26 '19 at 18:30