Using the Tidyverse to the count the frequency of a specific string in REDCap data across multiple columns

Question

I frequently receive data from REDCap surveys in which respondents are permitted to "Check" >1 response to a survey question. Each potential response is housed in its own column. I'd like to summarize the frequency with which each response option (column) has been checked. For example:

library(tidyverse)
set.seed(1234)
responses<-c("Checked", "Unchecked")
numobs<-10

my_example<-data.frame(id=1:10, 
                       Response_Option_A=sample(responses, numobs, replace=TRUE), 
                       Response_Option_B=sample(responses, numobs, replace=TRUE), 
                       Response_Option_C=sample(responses, numobs, replace=TRUE),
                       Response_Option_D=sample(responses, numobs, replace=TRUE),
                       stringsAsFactors = FALSE)

my_example
#>    id Response_Option_A Response_Option_B Response_Option_C Response_Option_D
#> 1   1         Unchecked         Unchecked         Unchecked           Checked
#> 2   2         Unchecked         Unchecked         Unchecked         Unchecked
#> 3   3         Unchecked         Unchecked         Unchecked           Checked
#> 4   4         Unchecked           Checked         Unchecked           Checked
#> 5   5           Checked         Unchecked         Unchecked           Checked
#> 6   6         Unchecked         Unchecked         Unchecked         Unchecked
#> 7   7           Checked         Unchecked           Checked           Checked
#> 8   8           Checked           Checked         Unchecked         Unchecked
#> 9   9           Checked         Unchecked         Unchecked         Unchecked
#> 10 10         Unchecked         Unchecked         Unchecked           Checked

My initial inclination was to try this, but it returns the total number of responses checked, rather than the number in each column.

my_example %>%
  select(starts_with("Response_Option_")) %>%
    summarise(checked=sum(.=="Checked"))
#>   checked
#> 1      13

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

Help summarizing these responses efficiently is appreciated.

Eric · Accepted Answer · 2020-08-10T23:13:20.533

This is a tidyverse approach to show the response totals per column, not by row. I assume, by how you worded your question, that this is what you are looking for. Also included is the starts_with() function which was included in your question tags.

We can use pivot_longer() to pivot the response features from wide to long then use group_by to define the variables taking the existing table and converting it to a grouped table where the summarise() operation is used to create a new data frame with rows provided for each combination of grouping variables.

library(tidyverse)
set.seed(1234)
responses<-c("Checked", "Unchecked")
numobs<-10

my_example<-data.frame(id=1:10, 
                       Response_Option_A=sample(responses, numobs, replace=TRUE), 
                       Response_Option_B=sample(responses, numobs, replace=TRUE), 
                       Response_Option_C=sample(responses, numobs, replace=TRUE),
                       Response_Option_D=sample(responses, numobs, replace=TRUE),
                       stringsAsFactors = FALSE)

my_example %>% 
  pivot_longer(starts_with("Response_"), names_to = "Responses", 
               values_to = "value") %>% 
  group_by(Responses, value) %>%
  summarise(total_responses = n())


#> # A tibble: 8 x 3
#> # Groups:   Responses [4]
#>   Responses         value     total_responses
#>   <chr>             <chr>               <int>
#> 1 Response_Option_A Checked                 4
#> 2 Response_Option_A Unchecked               6
#> 3 Response_Option_B Checked                 2
#> 4 Response_Option_B Unchecked               8
#> 5 Response_Option_C Checked                 1
#> 6 Response_Option_C Unchecked               9
#> 7 Response_Option_D Checked                 6
#> 8 Response_Option_D Unchecked               4

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

If you want just the Checked responses, you can add the following line of code after the summarise() operation:

filter(value == "Checked")

#> # A tibble: 4 x 3
#> # Groups:   Responses [4]
#>   Responses         value   total_responses
#>   <chr>             <chr>             <int>
#> 1 Response_Option_A Checked               4
#> 2 Response_Option_B Checked               2
#> 3 Response_Option_C Checked               1
#> 4 Response_Option_D Checked               6

score 1 · Answer 2 · answered Aug 10 '20 at 22:52

1

Check the tidyREDCap package. It has a bunch of functions to help process check all that apply variables coming from REDCap. The package is on CRAN and the website which is on github.io has the vignettes off the articles drop down at the top of the page.

answered Aug 10 '20 at 22:52

itsMeInMiami

2,324
1
13
34

1

Can you give an example in this answer? – Gabriel J. Odom Aug 11 '20 at 00:24

Ronak Shah · Answer 3 · 2020-08-11T11:12:54.407

1

You can use summarise with across :

library(dplyr)
my_example %>%
  summarise(across(starts_with("Response_Option_"), ~sum(. == 'Checked')))

#  Response_Option_A Response_Option_B Response_Option_C Response_Option_D
#1                 4                 2                 1                 6

In older version of dplyr you could use summarise_at :

my_example %>%
     summarise_at(vars(starts_with("Response_Option_")), ~sum(. == 'Checked'))

edited Aug 11 '20 at 11:12

answered Aug 11 '20 at 07:50

Ronak Shah

377,200
20
156
213

I needed to count the number of "Checked" values in each column, not each row. – user2230555 Aug 11 '20 at 11:08

Duck · Answer 4 · 2020-08-10T22:24:04.317

A very base R solution would be:

my_example$checked <- apply(my_example[,which(grepl('Response_Option_',names(my_example)))],1,
                            function(x) length(which(x=="Checked")))

Output:

   id Response_Option_A Response_Option_B Response_Option_C Response_Option_D checked
1   1         Unchecked         Unchecked         Unchecked           Checked       1
2   2         Unchecked         Unchecked         Unchecked         Unchecked       0
3   3         Unchecked         Unchecked         Unchecked           Checked       1
4   4         Unchecked           Checked         Unchecked           Checked       2
5   5           Checked         Unchecked         Unchecked           Checked       2
6   6         Unchecked         Unchecked         Unchecked         Unchecked       0
7   7           Checked         Unchecked           Checked           Checked       3
8   8           Checked           Checked         Unchecked         Unchecked       2
9   9           Checked         Unchecked         Unchecked         Unchecked       1
10 10         Unchecked         Unchecked         Unchecked           Checked       1

Also a very optimal way with credits for @r2evans:

my_example$checked <- rowSums(my_example[, grep("^Response_", colnames(my_example))] == "Checked")

Which produces same previous output and it more readable.

or just `rowSums(my_example[, grep("^Response_", colnames(my_example))] == "Checked")` (assigned to a column). There should be no need to use both `which` and `grepl`, grepl or grep by itself should suffice. — r2evans, Aug 10 '20 at 22:18
@r2evans It is a great suggestion, could I add to this answer to improve it? — Duck, Aug 10 '20 at 22:20
@r2evans Great, I have added the optimal way you mentioned. Many thanks for allowing it. — Duck, Aug 10 '20 at 22:25

Using the Tidyverse to the count the frequency of a specific string in REDCap data across multiple columns

4 Answers4