Wilcoxon Signed Rank Test with huge data

Question

I've been trying to perform Wilcoxon Signed Rank Test on my huge dataset. below is a representative of my sample

sample          Abiotrophia          Abyssicoccus        Acaryochloris                        yield  day  season
R11P4_BS       0.454828660436137     8.71569632259294    0                                    low    60   2
R13P1_BS       0.013239875389408     10.8649859288577    0.147574819401445                    high   60   1
R13P3_BS       0                     5.13545281606475    0.386996904024768                    low    30   1

I would like to compare and get a p-value for my sample group (ex: yield-high and low) for each bacteria (Abyssicoccus albus, Acaryochloris marina, Acetilactobacillus jinshanensis) baed on Wilcoxon Signed Rank Test. Final goal is to determine which bacteria differs significantly in 'high' and 'low'.

 > dput(head(dat))
    structure(list(Abiotrophia = c(0, 3.21408045977011, 0.117816091954023, 
    0, 0.002873563218391, 0), Abyssicoccus = c(0.454828660436137, 
    0.013239875389408, 0, 0, 0, 0.007009345794393), Acaryochloris = c(8.71569632259294, 
    10.8649859288577, 5.13545281606475, 5.72940386089162, 0.888392623745432, 
    3.93946335292641), day = c(0L, 60L, 60L, 60L, 90L, 90L), yield = c("low", 
    "high", "high", "low", "high", "high"), season = c(1L, 1L, 
    1L, 1L, 1L, 1L)), row.names = c("R11P4_BS", "R13P1_BS", "R13P3_BS", 
"R13P6_BS", "R14P1_BS", "R14P3_BS"), class = "data.frame")

this is what i managed to do so far, i believe this shows the p-value for data in row no7

      wilcox.test(data[data$yield == "low", 7], 
                  data[data$yield == "high", 7], exact=FALSE)$p.val

[1] 0.09657031

and the code below gave me error:

sapply(2:ncol(data), 
  function(x) {
      wilcox.test(data[data$yield == "low", 7], 
                  data[data$yield == "high", 7], exact=FALSE)$p.val
  }
)

jared_mamrot · Answer 1 · 2023-04-05T05:36:39.163

When you say 'huge', if you mean you have many rows, perhaps this approach would suit:

library(tidyverse)

df <- structure(list(Abiotrophia = c(0, 3.21408045977011, 0.117816091954023, 
                                     0.1, 0.002873563218391, 0.001), Abyssicoccus = c(0.454828660436137, 
                                                                                0.013239875389408, 0, 0.1, 0.1, 0.007009345794393), Acaryochloris = c(8.71569632259294, 
                                                                                                                                                  10.8649859288577, 5.13545281606475, 5.72940386089162, 0.888392623745432, 
                                                                                                                                                  3.93946335292641), day = c(0L, 60L, 60L, 60L, 90L, 90L), yield = c("low", 
                                                                                                                                                                                                                     "high", "high", "low", "high", "low"), season = c(1L, 1L, 
                                                                                                                                                                                                                                                                        1L, 1L, 1L, 1L)), row.names = c("R11P4_BS", "R13P1_BS", "R13P3_BS", 
                                                                                                                                                                                                                                                                                                        "R13P6_BS", "R14P1_BS", "R14P3_BS"), class = "data.frame")

wilcox_pvalues <- function(dataset, species) {
  tmp <- dataset %>%
    select({{species}}, yield) %>%
    split(.$yield)
  stack(Map(function(x, y) wilcox.test(x, y, exact = FALSE)$p.value, tmp[[1]][1], tmp[[2]][1]))
}

wilcox_pvalues(df, Abiotrophia)
#>      values         ind
#> 1 0.1904303 Abiotrophia
wilcox_pvalues(df, Abyssicoccus)
#>      values          ind
#> 1 0.5065552 Abyssicoccus
wilcox_pvalues(df, Acaryochloris)
#>   values           ind
#> 1      1 Acaryochloris

^{Created on 2023-04-05 with reprex v2.0.2}

If you have lots of species, here is a different option:

library(tidyverse)

# pivot to 'long' format and drop columns
df2 <- df %>%
  pivot_longer(-c(day, yield, season)) %>%
  select(-c(day, season))
df2
#> # A tibble: 18 × 3
#>    yield name             value
#>    <chr> <chr>            <dbl>
#>  1 low   Abiotrophia    0      
#>  2 low   Abyssicoccus   0.455  
#>  3 low   Acaryochloris  8.72   
#>  4 high  Abiotrophia    3.21   
#>  5 high  Abyssicoccus   0.0132 
#>  6 high  Acaryochloris 10.9    
#>  7 high  Abiotrophia    0.118  
#>  8 high  Abyssicoccus   0      
#>  9 high  Acaryochloris  5.14   
#> 10 low   Abiotrophia    0.1    
#> 11 low   Abyssicoccus   0.1    
#> 12 low   Acaryochloris  5.73   
#> 13 high  Abiotrophia    0.00287
#> 14 high  Abyssicoccus   0.1    
#> 15 high  Acaryochloris  0.888  
#> 16 low   Abiotrophia    0.001  
#> 17 low   Abyssicoccus   0.00701
#> 18 low   Acaryochloris  3.94

# calculate wilcox p-values
df2 %>%
  group_by(name) %>%
  summarise(pval = wilcox.test(value~yield, paired=FALSE, exact = FALSE)$p.value)
#> # A tibble: 3 × 2
#>   name           pval
#>   <chr>         <dbl>
#> 1 Abiotrophia   0.190
#> 2 Abyssicoccus  0.507
#> 3 Acaryochloris 1

^{Created on 2023-04-05 with reprex v2.0.2}

Wilcoxon Signed Rank Test with huge data

1 Answers1