In R, how do I make a long dataset that contains all the categories that a row applies to?

Question

I have a dataset of students, and I want to create a column that categorizes each student based on their and other students' data in their "subject" and "grade_letter" column because I eventually want to create a column that says "math_science_A", "science_B", etc. to know how many students got As, Bs, Cs, etc. based on the unique make ups.

I have a list of all potential combinations that a student can have, but when I try to iterate over R, my case_when only gives me the first category. I then tried to make multiple columns, so I could pivot longer later, but that also seemed to fail. Any help would be appreciated!

#Here's my sample dataset
library(tidyverse)
test <- tibble(student_id = c(1),
               subject = c("math"),
               grade_num = c(95),
               grade = c("a"))

#Here's a sample of what I want (full desired output would be too long to write; see text below code chunks)
sample_of_desired_output <- tibble(student_id = c(1),
               subject = c("math"),
               grade_num = c(95),
               grade = c("a"),
category = c("n_math_a", "n_science_a", "n_english_a", "n_math_science_a", "n_math_english_a"))

#####Here's what I tried
#Getting all combinations--this part works
all_subjects <- c("math", "science", "english")

i = 1
while(i <= length(all_subjects)) {
  
  if(i == 1) {
    all_subject_combos <- c()
  } else {
    some_combos <- combn(all_subjects, i,  FUN = function(x) paste(x, collapse = "_"), simplify = TRUE)
    all_subject_combos <- append(all_subject_combos, some_combos)
  }
  
  i <- i +1
}

all_grades <- c("A", "B", "C", "D", "F")

all_subjects_and_grades <- expand.grid(all_subject_combos, all_grades)

subjects_and_grades_combos <- all_subjects_and_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names =  tolower(paste0("n_", str_replace_all(names, "\\|", "_")))) %>%
  pull(names)


#This part does not work but is what I tried:

i = 1
while(i <= length(subjects_and_grades_combos)) {
  does_not_work <- test %>%
    mutate("subject_grade_grouping_{{i}}" := case_when(
      str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
  i <- i +1
}
does_not_work

Note that in the final, there will be all potential combinations, so each student will have their data repeating many times. After I have this, I will be using this to calculate the number of students who fall into each category. So, in this example, math_a, math_science_a, math_english_a, and math_science_english_a would all have 1. Note that I do not need help calculating that part, and I am flexible with the order of the columns.

@akrun, thank you for pointing that out--I fixed it and my test dataset. — J.Sabree, Jan 25 '23 at 22:30
Woulddn't you have to update the test object in each iteration or `does_not_work`. Here, it gets updated, but it returns only the last iteration changes i.e. `test1 <- test` (outside the loop) and inside the loop, `test1 <- test1 %>% mutate("subject_grade_grouping_{{i}}" := ...` — akrun, Jan 25 '23 at 22:45
@akrun, yea so my thought with the while argument is that it would iterate over each iteration of subjects_and_grades_combos and make 15 subject_grade_grouping_ columns (5 grades X 3 subjects) and then I could pivot_longer(). But, for some reason I can't figure out, it just gives 1 subject_grade_grouping_ column — J.Sabree, Jan 25 '23 at 22:47
If you make the changes I showed, it would return `dim(test1)# [1] 1 24` (based on your original data test with one row) — akrun, Jan 25 '23 at 22:48
I'm confused, you want to count (for instance) how many students got an A in math, but also how many got an A in both math and science? — Jon Spring, Jan 25 '23 at 22:49

score 1 · Accepted Answer · answered Jan 25 '23 at 22:51

The while loops needs to update the original object or else it gets the last iteration update

library(dplyr)
library(tidyr)
test1 <- test
 i = 1
while(i <= length(subjects_and_grades_combos)) {
  test1 <- test1 %>%
     mutate("subject_grade_grouping_{{i}}" := case_when(
       str_detect(subjects_and_grades_combos[i], subject) ~ subjects_and_grades_combos[i]))
   i <- i +1
 }

test1 %>% 
  pivot_longer(cols = starts_with('subject_grade'),
   names_to = NULL, values_to = "category", values_drop_na = TRUE)

-output

# A tibble: 15 × 5
   student_id subject grade_num grade category                
        <dbl> <chr>       <dbl> <chr> <chr>                   
 1          1 math           95 a     n_math_science_a        
 2          1 math           95 a     n_math_english_a        
 3          1 math           95 a     n_math_science_english_a
 4          1 math           95 a     n_math_science_b        
 5          1 math           95 a     n_math_english_b        
 6          1 math           95 a     n_math_science_english_b
 7          1 math           95 a     n_math_science_c        
 8          1 math           95 a     n_math_english_c        
 9          1 math           95 a     n_math_science_english_c
10          1 math           95 a     n_math_science_d        
11          1 math           95 a     n_math_english_d        
12          1 math           95 a     n_math_science_english_d
13          1 math           95 a     n_math_science_f        
14          1 math           95 a     n_math_english_f        
15          1 math           95 a     n_math_science_english_f

thank you! I was close. So is the secret that you need to define your object outside of the while loop to keep it from overriding? — J.Sabree, Jan 25 '23 at 22:59
@J.Sabree You could also use the `test` data. I just created a new object outside so that you can also have the original data if needed — akrun, Jan 25 '23 at 23:00
true, but the key is not to create a new object within the while loop to prevent it from overriding each time. Thanks! — J.Sabree, Jan 25 '23 at 23:02
@J.Sabree You are right. The key is assignment to original object — akrun, Jan 25 '23 at 23:03

score 0 · Answer 2 · answered Jan 25 '23 at 22:50

If you're just looking for how many students got each grade for each class:

library(dplyr); library(tidyr)
test %>%
  count(subject, grade) %>%
  complete(subject = c("math", "science", "english"),
           grade = c("a", "b", "c", "d", "f"), fill = list(n=0)) %>%
  pivot_wider(names_from = c(subject, grade), values_from = n)

# A tibble: 1 × 15
  english_a english_b english_c english_d english_f math_a math_b math_c math_d math_f science_a scienc…¹ scien…² scien…³ scien…⁴
      <int>     <int>     <int>     <int>     <int>  <int>  <int>  <int>  <int>  <int>     <int>    <int>   <int>   <int>   <int>
1         0         0         0         0         0      1      0      0      0      0         0        0       0       0       0
# … with abbreviated variable names ¹science_b, ²science_c, ³science_d, ⁴science_f

It sounds like you also want combinations of some sort, but it's not clear to me what you mean by that. Are we looking for the number of students who got an A in both math and science, or an A at least one of math and science? The sample data and output doesn't give a sense of what that should look like.

thank you for this! Yes, this gives me all the students for the individual subjects, but I also want the combinations, so math_science_a, math_science_english_a. For instance, if a student got an a in math and science, then math_science_a would be 2, etc. — J.Sabree, Jan 25 '23 at 22:55

In R, how do I make a long dataset that contains all the categories that a row applies to?

2 Answers2