How to split a dataframe into smaller ones using one unique variable?

Question

I am working with a teacher absences file. I need to flag any instance of 5+ consecutive absences.

I have a dataframe like the one below. How do I split this data into new dataframes by teacher name?

absences <- data.frame(
  staffid = c("123","456","789","101","121", "123", "123", "123", "123"), 
  name = c("Kara","Barbie","Sam","Jane","Chris", "Kara", "Kara", "Kara", "Kara"), 
  date = c(as.Date("2022-08-31"), as.Date("2022-09-01"), as.Date("2022-09-01"), 
           as.Date("2022-09-02"), as.Date("2022-09-07"), as.Date("2022-09-01"), 
           as.Date("2022-09-02"), as.Date("2022-09-06"), as.Date("2022-09-07")),
  schoolday = c(1, 2, 2, 3, 5, 2, 3, 4, 5))

I tried the code below:

absences_new <- absences %>% nest(.by = name)

which gave me

Sorry, I am new to R and do not know what the mini-data frames in the output are called.

I also tried:

X <- split(absences, absences$name)

which gave me a result more like what I'm looking for, but it is in a format I don't know how to work with.

I also tried:

teachername <- (unique(absences$name))
splitdata <- split(absences, teachername)

but this gave me an error of "data length is not a multiple of split variable."

What I want for my output is something like what this would make:

Kara <- data.frame(
  staffid = c("123","123", "123", "123", "123"), 
  name= c("Kara", "Kara", "Kara", "Kara", "Kara"), 
  date = c(as.Date("2022-08-31"), as.Date("2022-09-01"),  
           as.Date("2022-09-02"), as.Date("2022-09-06"), 
           as.Date("2022-09-07")),
  schoolday = c(1, 2, 3, 4, 5))


Sam <- data.frame(
  staffid = c("789"), 
  name= c("Sam"), 
  date = c(as.Date("2022-09-01")),
  schoolday = c(2))

Then, my plan is to take these mini data frames and scan for any consecutive days.

Thank you!!

Already tried `split(absences, absences$name)`? If you really want to bloat the workspace with separate data frames look into `list2env` thereafter. — jay.sf, Mar 31 '23 at 21:13
The `split()` which you are attempting, and which is doable with the code that @jay.sf helpfully supplied for you, will not be necessary to complete your broader objective of identifying consecutive absences. You should provide an example dataset that actually has 5 consecutive absences. Or does `Kara` meet that criteria? — langtang, Mar 31 '23 at 21:14
Seems what you want is an intermediate step. What is the final output that you want? — Onyambu, Mar 31 '23 at 21:37
@JonSpring the code you gave will fail. Assume the teacher alternates s.t after attending 2 days, he misses 4days. Yours will flag this teacher as missing 5 days consecutive. but the teacher never missed 5 days consecutively. — Onyambu, Mar 31 '23 at 21:40
In addition to the other questions, how are you handling weekends? If a staff member is absent on a Friday and the following Monday, does the Monday count as a consecutive absence? — jdobres, Mar 31 '23 at 22:24

score 0 · Answer 1 · answered Mar 31 '23 at 21:33

If you are looking for consecutive integers in schoolday, you can make a function that returns TRUE if any set of consecutive integers meets/exceed length n, like this this

consecutive_n <- function(d,n=5) {
  any(sapply(split(d, cumsum(c(1, diff(d)!= 1))),length)>=n)
}

And then apply that by each name

absences %>% 
  arrange(date) %>%
  reframe(absent_5 = consecutive_n(schoolday), .by = name)

Output:

    name absent_5
1   Kara     TRUE
2 Barbie    FALSE
3    Sam    FALSE
4   Jane    FALSE
5  Chris    FALSE

score 0 · Answer 2 · answered Apr 01 '23 at 15:02

This code gets you datasets for each teacher:

library(tidyverse)

teacher_names <- absences %>% distinct(name) %>% pull()

getting_teacher_dfs <- function(teacher_name) {
    absences %>% 
    filter(name == teacher_name)
}

teacher_datasets <- map(teacher_names, getting_teacher_dfs)

You can then pull each dataset like this.

teacher_datasets[[1]]

Overall, though, you probably should look into using group_by feature in dplyr instead of creating this many new datasets.

How to split a dataframe into smaller ones using one unique variable?

2 Answers2