How to select groups with data that can be compared within multiple groups (using dplyr and rstatix)?

Question

I want to run t-tests by gender within groups. I have two group variables (group_1 and group_2) and multiple outcome variables (var1 and var2 - though in my dataset I have many variables).

  #Packages
library(dplyr)
library(reshape2)
library(rstatix)
##Dataset
group_1 <-c(rep("Group X", 40), rep("Group Y", 40), 
                      rep("Group Z", 60), rep("Group Y", 20),
                      rep("Group Z", 50), rep("Group Y", 10))
group_2 <- c(rep("A", 100), rep("B", 20), rep("C", 50), rep("A", 20), rep("B", 30))
var1 <- rnorm(n=220, mean=0, sd=1)
var2 <- rnorm(n = 220, mean = 1, sd=1.3)
gender <- c(rep("M", 30), rep("F", 30), rep("M", 40) , rep("F", 50), rep("M", 20), 
            rep("F", 20), rep("M", 30))
data <- as.data.frame(cbind(group_1, group_2, var1, var2, gender))

##Groupings
table(data$group_1, data$group_2, data$gender)

#Long format
g_long <- gather(data, variable, value, var1:var2)
g_long$value <- as.numeric(g_long$value)

#T-tests for each variable within groups
g_test <- g_long %>%
  group_by(variable, group_1, group_2) %>%
  t_test(value ~ gender, p.adjust.method = "holm", paired=FALSE)

The above code gives me the error below:

Error: Problem with `mutate()` input `data`.
x not enough 'y' observations
i Input `data` is `map(.data$data, .f, ...)`.

This code does work with only one group, or if I remove the right data:

  #this works
  g_test <- g_long %>%
  group_by(variable, group_1) %>%
  t_test(value ~ gender, p.adjust.method = "holm", paired=FALSE)

#Manually remove category where I cannot calculate gender diff - this works
g_long1 <- g_long[!(g_long$group_1 == "Group Y" & g_long$group_2 == "B"),]

g_test <- g_long1 %>%
  group_by(variable, group_1, group_2) %>%
  t_test(value ~ gender, p.adjust.method = "holm", paired=FALSE)

There are no women in the group Y & group B category, so the code works if I manually remove them. I tried something like the below to automatically detect and remove these categories, but it doesn't help because it can't delete the data if there are either no men, or no women per category.

  g_long<- g_long %>% 
  group_by(group_1, group_2, variable, gender) %>% 
  filter(n() >= 5)

How can I automatically remove categories for which I cannot run t-tests? I have more than 3 categories for each group in my dataset, so manually selecting each group would be difficult.

akrun · Accepted Answer · 2021-08-06T17:59:25.780

We may use nest_by and create a list column with transmute using a logical condition that checks the number of distinct (n_distinct) elements in 'gender' for each group

library(dplyr)
library(rstatix)
g_long %>% 
   nest_by(variable, group_1, group_2) %>%
   transmute(out = list(if(n_distinct(data$gender) > 1) data %>%
       t_test(value ~ gender, p.adjust.method = "holm", 
        paired=FALSE) else NA)) %>%
   ungroup

-ouptut

# A tibble: 14 x 4
   variable group_1 group_2 out                   
   <chr>    <chr>   <chr>   <list>                
 1 var1     Group X A       <rstatix_test [1 × 8]>
 2 var1     Group Y A       <rstatix_test [1 × 8]>
 3 var1     Group Y B       <lgl [1]>             
 4 var1     Group Y C       <rstatix_test [1 × 8]>
 5 var1     Group Z A       <rstatix_test [1 × 8]>
 6 var1     Group Z B       <rstatix_test [1 × 8]>
 7 var1     Group Z C       <rstatix_test [1 × 8]>
 8 var2     Group X A       <rstatix_test [1 × 8]>
 9 var2     Group Y A       <rstatix_test [1 × 8]>
10 var2     Group Y B       <lgl [1]>             
11 var2     Group Y C       <rstatix_test [1 × 8]>
12 var2     Group Z A       <rstatix_test [1 × 8]>
13 var2     Group Z B       <rstatix_test [1 × 8]>
14 var2     Group Z C       <rstatix_test [1 × 8]>

To extract the list element use unnest

library(tidyr)
> g_long %>% 
+    nest_by(variable, group_1, group_2) %>%
+    transmute(out = list(if(n_distinct(data$gender) > 1) data %>%
+        t_test(value ~ gender, p.adjust.method = "holm", 
+         paired=FALSE) else NA)) %>%
+    ungroup %>% 
+    unnest(out)
# A tibble: 14 x 12
   variable group_1 group_2 .y.   group1 group2    n1    n2 statistic    df      p out  
   <chr>    <chr>   <chr>   <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl> <lgl>
 1 var1     Group X A       value F      M         10    30  -0.350    30.7  0.729 NA   
 2 var1     Group Y A       value F      M         20    20  -0.0286   37.7  0.977 NA   
 3 var1     Group Y B       <NA>  <NA>   <NA>      NA    NA  NA        NA   NA     NA   
 4 var1     Group Y C       value F      M         10    10   0.221    17.0  0.828 NA   
 5 var1     Group Z A       value F      M         20    20  -0.0811   38.0  0.936 NA   
 6 var1     Group Z B       value F      M         20    20  -1.03     34.7  0.309 NA   
 7 var1     Group Z C       value F      M         20    10  -1.17     20.3  0.256 NA   
 8 var2     Group X A       value F      M         10    30  -0.601    13.0  0.558 NA   
 9 var2     Group Y A       value F      M         20    20  -0.824    36.8  0.415 NA   
10 var2     Group Y B       <NA>  <NA>   <NA>      NA    NA  NA        NA   NA     NA   
11 var2     Group Y C       value F      M         10    10  -0.00521  17.6  0.996 NA   
12 var2     Group Z A       value F      M         20    20  -0.956    38.0  0.345 NA   
13 var2     Group Z B       value F      M         20    20   0.593    31.2  0.557 NA   
14 var2     Group Z C       value F      M         20    10  -1.57     17.0  0.136 NA

Regarding the error in OP's post, it is related to the number of unique 'gender' elements which is 1 for group_1 'Y' and 'group_2' 'B'

How to select groups with data that can be compared within multiple groups (using dplyr and rstatix)?

1 Answers1