How to perform statistical test using dplyr grouping and then make tibble with broom

Question

I have the following data frame:

library(tidyverse)

dat <- structure(list(charge.Group3 = c(0.167, 0.167, 0.1, 0.067, 0.033, 
0.033, 0.067, 0.133, 0.2, 0.067, 0.133, 0.114, 0.167, 0.033, 
0.1, 0.033, 0.133, 0.267, 0.133, 0.233, 0.1, 0.167, 0.067, 0.133, 
0.1, 0.133, 0.1, 0.133, 0.1, 0.067, 0.167, 0), hydrophobicity.Group3 = c(0.267, 
0.467, 0.067, 0.167, 0.267, 0.1, 0.367, 0.233, 0.367, 0.233, 
0.133, 0.205, 0.333, 0.267, 0.267, 0.067, 0.133, 0.3, 0.233, 
0.267, 0.5, 0.333, 0.2, 0.5, 0.5, 0.4, 0.033, 0.3, 0.233, 0.5, 
0.233, 0.033), class = c("Negative", "Negative", "Positive", 
"Positive", "Positive", "Positive", "Positive", "Negative", "Positive", 
"Positive", "Positive", "Positive", "Positive", "Positive", "Negative", 
"Positive", "Negative", "Negative", "Negative", "Negative", "Negative", 
"Negative", "Negative", "Negative", "Negative", "Negative", "Positive", 
"Positive", "Positive", "Negative", "Positive", "Negative")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -32L))

dat
#> # A tibble: 32 x 3
#>    charge.Group3 hydrophobicity.Group3 class   
#>            <dbl>                 <dbl> <chr>   
#>  1         0.167                 0.267 Negative
#>  2         0.167                 0.467 Negative
#>  3         0.1                   0.067 Positive
#>  4         0.067                 0.167 Positive
#>  5         0.033                 0.267 Positive
#>  6         0.033                 0.1   Positive
#>  7         0.067                 0.367 Positive
#>  8         0.133                 0.233 Negative
#>  9         0.2                   0.367 Positive
#> 10         0.067                 0.233 Positive
#> # ... with 22 more rows

What I want to do for each features: charge.Group3 and hydrophobicity.Group3, perform wilcox.test between Negative and positive class. And finally get the p-value as data frame or tibble:

features                      pvalue
charge.Group3                 0.1088  
hydrophobicity.Group3         0.03895
# I do by hand.

Note that there are actually more than 2 features. How can I achieve that?

AntoniosK · Accepted Answer · 2018-08-26T13:25:46.933

You don't really need to use broom if you need only the p-value of the tests.

library(tidyverse)


dat %>% 
  gather(group, value, -class) %>%    # reshape data            
  nest(-group) %>%                    # for each group nest data
  mutate(pval = map_dbl(data, ~wilcox.test(value ~ class, data = .)$p.value)) %>%  # get p value for wilcoxon test
  select(-data)                       # remove data column


# # A tibble: 2 x 2
#   group                   pval
#   <chr>                  <dbl>
# 1 charge.Group3         0.109 
# 2 hydrophobicity.Group3 0.0390

Reshaping first will enable you to apply this process no matter how many columns you have, assuming that class is the only extra variable.

Or you can even avoid map as @Moody_Mudskipper suggested using

dat %>% 
  gather(group, value, -class) %>% 
  group_by(group) %>% 
  summarize(results = wilcox.test(value ~ class)$p.value)

If you really want to involve broom then you can do

library(broom)

dat %>% 
   gather(group, value, -class) %>%  
   nest(-group) %>%                  
   mutate(results = map(data, ~tidy(wilcox.test(value ~ class, data = .)))) %>%
   select(-data) %>%
   unnest(results)

# # A tibble: 2 x 5
# group                 statistic p.value method                                            alternative
#   <chr>                     <dbl>   <dbl> <chr>                                             <chr>      
# 1 charge.Group3              170.  0.109  Wilcoxon rank sum test with continuity correction two.sided  
# 2 hydrophobicity.Group3      183   0.0390 Wilcoxon rank sum test with continuity correction two.sided

which returns more columns, but you can keep the p-value if you want.

I think you can make it really nice and idiomatic by skipping the nesting, grouping is enough here : `dat %>% gather(group, value, -class) %>% group_by(group) %>% summarize(pval = wilcox.test(value ~ class)$p.value)` (upvoted in any case) — moodymudskipper, Aug 14 '18 at 10:05
Indeed. That's what I had in mind but for some reason today I used `summarize(results = wilcox.test(value ~ class, data = .)$p.value)` and it didn't work due to that `data = .` thing! :( Thanks for the reminder. — AntoniosK, Aug 14 '18 at 10:10

moodymudskipper · Answer 2 · 2018-08-14T09:54:36.873

2

Here's a way to do it with dplyr::summarize_at and tidyr::gather :

library(tidyverse)
dat %>%
  summarize_at(c("charge.Group3","hydrophobicity.Group3"),
               ~wilcox.test(.x ~ .y)$p.value, .$class) %>%
  gather(features, pvalue)

# # A tibble: 2 x 2
#                features pvalue
#                   <chr>  <dbl>
# 1         charge.Group3  0.109
# 2 hydrophobicity.Group3  0.039

to summarize all variables except class :

dat %>%
  summarize_at(vars(-class),
               ~wilcox.test(.x ~ .y)$p.value,
               .$class) %>%
  gather(features,pvalue)

edited Aug 14 '18 at 09:54

answered Aug 14 '18 at 09:32

moodymudskipper

46,417
11
121
167

Thanks. But how can I generalized your code. Coz more than 2 features. I cannot hard code them at summarised_at – littleworth Aug 14 '18 at 09:34
Yes. Everything except class. – littleworth Aug 14 '18 at 09:50
changed to formula notation as it's much more compact (inspired by @AntoniosK) – moodymudskipper Aug 14 '18 at 09:55

How to perform statistical test using dplyr grouping and then make tibble with broom

2 Answers2