0

I'm trying to modify a solution posted here Create cohort dropout rate table from raw data

I'd like to create a CUMULATIVE dropout rate table using these data.

DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
     11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
       2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
   2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
        2,2,2,1,1,2,1,2,1,2,
        1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))

So far, I've been able to get to this point

     library(tidyverse)

DT %>% 
  group_by(year) %>% 
  count(cohort) %>% 
  ungroup() %>% 
  spread(year, n) %>% 
  mutate(y2014_2015_dropouts = (`2014` - `2015`),
         y2015_2016_dropouts =  (`2015` - `2016`)) %>% 
  mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
         y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%


  replace_na(list(y2014_2015_dropouts = 0.0,
                  y2015_2016_dropouts = 0.0)) %>% 
  select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )

A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.

     # A tibble: 3 x 5
  cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
   <dbl>               <dbl>               <dbl>                 <dbl>                 <dbl>
1      1                   6                   2                   0.6                   0.8
2      2                   0                   2                  NA                    NA  
3      3                   0                   0                  NA                    NA  
>  

The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.

I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.

Chris
  • 353
  • 3
  • 9

2 Answers2

2

Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:

Organize and order by cohort and year:

DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]

Assign the year range:

DT2[, year := paste(lag(year), year, sep = "_"),]

Get dropouts per year

DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]

Get the cumulative sum of proportion dropped out each year per cohort:

DT2[, cumul := cumsum(dropouts) / max(N), cohort]

Output:

> DT2
   cohort      year  N dropouts     cumul
1:      1   NA_2014 10        0 0.0000000
2:      1 2014_2015  4        6 0.6000000
3:      1 2015_2016  2        2 0.8000000
4:      2 2016_2015  6        0 0.0000000
5:      2 2015_2016  4        2 0.3333333
6:      3 2016_2016  9        0 0.0000000
Luke C
  • 10,081
  • 1
  • 14
  • 21
1

Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current

y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative

to

y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)

you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using

DT %>% 
group_by(year) %>% 
count(cohort) %>% 
ungroup() %>% 
group_by(cohort) %>% 
mutate(dropouts = lag(n) - n,
       dropout_rate = dropouts / max(n)) %>% 
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>% 
mutate(cumulative_dropouts = cumsum(dropouts),
       cumulative_dropout_rate = cumulative_dropouts / max(n))
B. Sharp
  • 178
  • 8