Count multiple columns with categorical data after being grouped by another column

Question

I have a large data set (49000 X 118) and what I would like to do is I want to group by one column then have the summary of multiple columns. The issue with my data is that the summary of each column has a different length.

Here is a simple example of my data

dat<- data.frame(test_number= as.factor(c("test1", "test1", "test1","test1","test1","test1", "test2","test2","test2", "test3","test3","test3","test3","test3","test3")), 
                 question1_response= as.factor(c("yes", NA, "no","not answered", "yes", "yes", NA, "no","yes","yes","yes","yes","yes","yes","yes")),
                 question2_response= as.factor(c("yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","yes","no")),
                 question3_response= as.factor(c("yes", NA, "no","yes", NA, "no","yes", NA, "no","yes", NA, "no","yes", NA, "no")))

I would like to group by test_number and get a summary of each response in columns 2:4

some of the codes I have tried:

summary1<- dat %>%
  group_by(test_number) %>%
  group_map(~summarize(.x, across(everything(), summary)))

lapply(dat[-1], 
       FUN = function(x) { group_by(test_number) %>% summary(factor(x)) })

dat %>%
  group_by(test_number) %>%
  lapply(dat[c(2:4)], FUN = function(x) summary(x))

I expect the result to be something like this (i did it in excel)

enter image description here

I replaced the unequal column length with NAs but i am not particular about the structure as long as i get the information.

Thank you

score 0 · Accepted Answer · answered Mar 13 '23 at 08:54

You can stack all the question responses into a single column, and transform to wide with values_fn = length for counting.

library(tidyr)

dat %>%
  pivot_longer(-test_number) %>%
  pivot_wider(id_cols = c(test_number, value), names_from = name,
              values_from = name, values_fn = length, values_fill = 0)

# # A tibble: 10 × 5
#    test_number value        question1_response question2_response question3_response
#    <fct>       <fct>                     <int>              <int>              <int>
#  1 test1       yes                           3                  6                  2
#  2 test1       NA                            1                  0                  2
#  3 test1       no                            1                  0                  2
#  4 test1       not answered                  1                  0                  0
#  5 test2       NA                            1                  0                  1
#  6 test2       yes                           1                  3                  1
#  7 test2       no                            1                  0                  1
#  8 test3       yes                           6                  5                  2
#  9 test3       NA                            0                  0                  2
# 10 test3       no                            0                  1                  2

Thank you very much! it works beautifully for my data. Aside from my main question, would you mind explaining what the word "name" refers to in your code? i always see it and i tried to google it before but no luck. Thank you! — Chi, Mar 13 '23 at 09:08
@Chi After you run `dat %>% pivot_longer(-test_number)`, you can see that all the question responses are stacked, and the names of those stacked columns are stored in a new column named `name`. — Darren Tsai, Mar 13 '23 at 09:14

score 0 · Answer 2 · edited Mar 13 '23 at 09:34

Is this what you need?

library(tidyverse)

dat %>%
  pivot_longer(matches("question")) %>%
  group_by(test_number, name, value) %>%
  summarise(n = n(), .groups = "drop") %>%
  pivot_wider(values_from = n)

# A tibble: 10 × 5
   test_number value        question1_response question2_response question3_response
   <fct>       <fct>                     <int>              <int>              <int>
 1 test1       no                            1                 NA                  2
 2 test1       not answered                  1                 NA                 NA
 3 test1       yes                           3                  6                  2
 4 test1       NA                            1                 NA                  2
 5 test2       no                            1                 NA                  1
 6 test2       yes                           1                  3                  1
 7 test2       NA                            1                 NA                  1
 8 test3       yes                           6                  5                  2
 9 test3       no                           NA                  1                  2
10 test3       NA                           NA                 NA                  2

Thank you Chris. In my main data each column has a completely different name (drug names) bu i just changed it to pivot_longer(-test_number) and it was what i am looking for! really appreciated — Chi, Mar 13 '23 at 09:14
`group_by(test_number, name, value) %>% summarise(n = n())` can be reduced to `count(test_number, name, value)` — Darren Tsai, Mar 13 '23 at 09:23

Count multiple columns with categorical data after being grouped by another column

2 Answers2